The System Health dashboard provides an overview of all devices health states based on metrics.
The dashboard is composed of multiple cards divided into groups(optional) to display the health of a device.
In addition, the dashboard displays the DPOD health state.
Assumptions:
- The existing alerts infrastructure (with additional few more fields) will provide all data and logic to decide if a sample should be alerted and if it considers an error or warning or good.
- The alerts mechanism will be the only source of current and future metrics. New fields: Is alert used as health metric, warning threshold, damage points
- Any health metric must be based on a detailed investigation screen
Prerequisites:
Metric:
- List of default metrics:
- The user can define whether a metric is part of the system health by selecting the "System Health Metric" option
- You may edit the metric setting from [Manage → Alert → Setup Alerts → Edit Alert]
Device Health Settings:
- For each device, the user can define whether the device is displayed in the System Health dashboard, Damage Points Threshold, Total Warnings Threshold
- For each device, the user can set thresholds and damage points per health metric. - TODO add link to device settings
Device Group Settings:
- For each device, the user can define the device group and the display order - TODO add link to device groups
System Parameters (DB) - default values for:
- "System Health Dashboard Sample Time Range (min.)" - default to 5 minutes.
Device Card description:
- Health states:
Icon | description |
---|---|
Good | |
Warning | |
Error | |
No metrics samples | |
+ background color of card is red | Critical |
- A single card to display the health of a device - includes:
- Big icon to indicate system health based on the metrics in the last X (System Parameter) minutes
- If no metrics found then health is critical - the device is dead and marked as Critical!
- If the "Device Availability" metric last sample is Error or all samples are Error within the last X minutes, the device is marked as Critical.
- If one error or more are within the last X minutes then device health is marked as Error (no matter the time order of system health)
- If no errors found but some warning exists then
- If total damage points for all metrics is bigger than the damage points for a device(for a specific time range) to sustain then device health is Error.
- If total of warning is bigger than number of warning found for a device than device health is Error
- else the device health is Warning
- If no Errors or Warnings found then device health is Good
- Small icons to display past hour health - using the same logic as the current health
- Big icon to indicate system health based on the metrics in the last X (System Parameter) minutes
- Clicking a device card should direct to the device health dashboard
- Device Health Dashboard (drill-down)
- A series of charts to display metric values per device - each metric with its own chart over time
- Each chart is of "Scatter" type, divided to 4 parts (each part represents 15 minutes)
- Each point in the graph should display the right color (green for Good, red for Error etc.) and display the value in the tooltip of the point
- All points should overlap a little so the display is compact and looks like a thick line built of points:
- Clicking on a metric graph will dispatch the user to further investigation in one of the product existing analytics dashboards
- A series of charts to display metric values per device - each metric with its own chart over time