The System Health dashboard provides an overview of all devices health state based on user-defined metrics.
The dashboard is composed of multiple cards, each representing the health of a device, optionally divided into device groups.
In addition, the dashboard displays the DPOD health state based on Internal Health Alerts.
Metrics
- The health of each device is based on several user-defined metrics. For example, the CPU of the device.
- A metric is basically a criteria and a set of thresholds that together define whether the state of the device in that aspect is one of: Good / Waring / Error.
- Metrics are based on the Alerts subsystem of DPOD:
- The user can define which alerts are part of the System Health by selecting the "System Health Metric" option under [Manage → Alert → Setup Alerts → Edit Alert].
- Each alert can be used as a simple alert, as a System Health metric, or both. Using an alert both for alerting and as a System Health metric is recommended, since it makes sure the System Health dashboard will precisely reflect the sent alerts.
- The following System Health metrics are defined by default:
- Devices CPU Metric
- Devices Memory Metric
- Devices Load Metric
- Devices Fan Metric
- Devices Temperature Metric
- Devices Voltage Metric
- Devices Space Encrypted Metric
- Devices Space Temp Metric
- Devices Space Internal Metric
- System Errors Metric
- Device Availability Metric - This is an internal metric based on "Device Resources Monitoring" option selected at the device level, that checks whether the device is available or not.
Prerequisites:
Device Health Calculation
- The System Health dashboard calculates the health of each device in the past hour.
- The past hour is divided to 5 parts:
- Last 5 minutes (may be configured via "System Health Dashboard Sample Time Range (min.)" System Parameter)
- Previous 10 minutes
- 3 parts of 15 minutes (the rest of the hour)
- Each part displays a single icon with the health of the device during that period of time:
Icon | Description | Last 5 minutes | Other parts |
---|---|---|---|
Good | No errors or warnings found in metric samples | (same) | |
Warning | Warnings found in metric samples | (same) | |
Error | Errors found in metric samples OR | (same) | |
Unknown | - | No metric samples found (e.g. DPOD alert subsystem was down) OR | |
+ red background color | Critical | No metric samples found (e.g. DPOD alert subsystem was down) OR | - |
Metric:
Device Health Settings:
- For each device, the user can define whether the device is displayed in the System Health dashboard, Damage Points Threshold, Total Warnings Threshold
- For each device, the user can set thresholds and damage points per health metric. - see device health settings
Device Group Settings:
- For each device, the user can define the device group and the display order - see device group settings
System Parameters (DB) - default values for:
- "System Health Dashboard Sample Time Range (min.)" - default to 5 minutes.
Device Card:
- A single device card - includes:
- Health states of the past hour divided to 5 parts, last X minutes and 4 parts of 15 minutes:
...
- minutes
...
...
- If no Errors or Warnings found in metric samples
...
- If no Errors or Warnings found in metric samples
...
...
- If warning exists
...
- If warning exists
...
...
- .
...
- If error exists
- If no errors found but some warning exists then
- If total damage points for all metrics is bigger than the damage points for a device to sustain.
- If total of warning is bigger than number of warning found for a device
...
...
- If no metric samples found
- If all "Device Availability" metric samples are Error
...
+ background color of card is red
...
- If no metric samples found
- If all "Device Availability" metric samples are Error
- If the "Device Availability" metric last sample is Error
...
- Clicking a device card should direct to the device health dashboard
...