The System Health dashboard provides an overview of all devices health state based on user-defined metrics.
The dashboard is composed of multiple cards, each representing the health of a device, divided into device groups (optional).
In addition, the dashboard displays the DPOD health state.
Metrics
- The health of each device is based on several user-defined metrics.
- A metric is basically a criteria and a set of thresholds that define whether the state of that criteria is good or bad.
- Metrics are based on the Alerts feature of DPOD. The user can define which alerts are part of the system health.
- The following System Health metrics are defined by default:
- Devices CPU Metric
- Devices Memory Metric
- Devices Load Metric
- Devices Fan Metric
- Devices Temperature Metric
- Devices Voltage Metric
- Devices Space Encrypted Metric
- Devices Space Temp Metric
- Devices Space Internal Metric
- System Errors Metric
- The user can define whether a metric is part of the system health by selecting the "System Health Metric" option
- The user may edit the metric setting from [Manage → Alert → Setup Alerts → Edit Alert]
- In addition, there is an internal metric named "Device Availability" which is not editable, the purpose of this metric is to sample the device availability.
Assumptions:
- The existing alerts infrastructure (with additional few more fields) will provide all data and logic to decide if a sample should be alerted and if it considers an error or warning or good.
- The alerts mechanism will be the only source of current and future metrics. New fields: Is alert used as health metric, warning threshold, damage points
- Any health metric must be based on a detailed investigation screen
Prerequisites:
Metric:
Device Health Settings:
- For each device, the user can define whether the device is displayed in the System Health dashboard, Damage Points Threshold, Total Warnings Threshold
- For each device, the user can set thresholds and damage points per health metric. - see device health settings
Device Group Settings:
- For each device, the user can define the device group and the display order - see device group settings
System Parameters (DB) - default values for:
- "System Health Dashboard Sample Time Range (min.)" - default to 5 minutes.
Device Card:
- A single device card - includes:
- Health states of the past hour divided to 5 parts, last X minutes and 4 parts of 15 minutes:
Icon | description | last X minutes(System Parameter) | 4 parts of 15 minutes |
---|---|---|---|
Good |
|
| |
Warning |
|
| |
Error |
|
| |
No metrics samples | - |
| |
+ background color of card is red | Critical |
| - |
- Clicking a device card should direct to the device health dashboard
- Device Health Dashboard (drill-down)
- A series of charts to display metric values per device - each metric with its own chart over time
- Each chart is of "Scatter" type, divided to 4 parts (each part represents 15 minutes)
- Each point in the graph should display the right color (green for Good, red for Error etc.) and display the value in the tooltip of the point
- All points should overlap a little so the display is compact and looks like a thick line built of points:
- Clicking on a metric graph will dispatch the user to further investigation in one of the product existing analytics dashboards
- A series of charts to display metric values per device - each metric with its own chart over time