Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The System Health dashboard provides an overview of all devices health state based on user-defined metrics.

The dashboard is composed of multiple cards, each representing the health of a device, optionally divided into device groups.

In addition, the dashboard displays the DPOD health state based on Internal Health Alerts.

Metrics

  • The health of each device is based on several user-defined metrics. For example, the CPU of the device.
  • A metric is basically a criteria and a set of thresholds that together define whether the state of the device in that aspect is one of: Good / Waring / Error.
  • Metrics are based on the Alerts subsystem of DPOD:
    • The user can define which alerts are part of the System Health by selecting the "System Health Metric" option under [Manage → Alert → Setup Alerts →  Edit Alert].
    • Each alert can be used as a simple alert, as a System Health metric, or both. Using an alert both for alerting and as a System Health metric is recommended, since it makes sure the System Health dashboard will precisely reflect the sent alerts.
  • The following System Health metrics are defined by default:
    • Devices CPU Metric
    • Devices Memory Metric
    • Devices Load Metric
    • Devices Fan Metric
    • Devices Temperature Metric
    • Devices Voltage Metric
    • Devices Space Encrypted Metric
    • Devices Space Temp Metric
    • Devices Space Internal Metric
    • System Errors Metric
    • Device Availability Metric - This is an internal metric based on "Device Resources Monitoring" option selected at the device level, that checks whether the device is available or not.

Prerequisites:

Device Health Calculation

  • The System Health dashboard calculates the health of each device in the past hour.
  • The past hour is divided to 5 parts:
    • Last 5 minutes (may be configured via "System Health Dashboard Sample Time Range (min.)" System Parameter)
    • Previous 10 minutes
    • 3 parts of 15 minutes (the rest of the hour)
  • Each part displays a single icon with the health of the device during that period of time:
IconDescriptionLast 5 minutesOther parts

Image Added

Good

No errors or warnings found in metric samples

(same)
Image Added
WarningWarnings found in metric samples(same)
Image Added
Error

Errors found in metric samples OR
Warnings count exceeded threshold (see below) OR
Warnings damage points exceeded threshold (see below)

(same)
Image Added
Unknown-

No metric samples found (e.g. DPOD alert subsystem was down) OR
The device was unavailable during the entire time period

Image Added + red background color

Critical

No metric samples found (e.g. DPOD alert subsystem was down) OR
The device was unavailable in the last "Device Availability Metric" sample

-

Metric:

Device Health Settings:

  • For each device, the user can define whether the device is displayed in the System Health dashboard, Damage Points Threshold, Total Warnings Threshold
  • For each device, the user can set thresholds and damage points per health metric. - see device health settings

Device Group Settings:

System Parameters (DB) - default values for:

  • "System Health Dashboard Sample Time Range (min.)" - default to 5 minutes.

Device Card:

  • A single device card - includes:
    • Health states of the past hour divided to 5 parts, last X minutes and 4 parts of 15 minutes:

...

  • minutes

...

Image Removed

...

  • If no Errors or Warnings found in metric samples

...

  • If no Errors or Warnings found in metric samples

...

Image Removed

...

  • If warning exists

...

  • If  warning exists

...

Image Removed

...

  • .

...

  • If  error exists
  • If no errors found but some warning exists then
    • If total damage points for all metrics is bigger than the damage points for a device to sustain.
    • If total of warning is bigger than number of warning found for a device

...

Image Removed

...

  • If  no metric samples found
  • If all "Device Availability" metric samples are Error

...

Image Removed + background color of card is red

...

  • If  no metric samples found
  • If all "Device Availability" metric samples are Error
  • If  the "Device Availability" metric last sample is Error

...

  • Clicking a device card should direct to the device health dashboard

...