The System Health dashboard provides an overview of all devices health states based state based on user-defined metrics.

The dashboard is composed of multiple cards divided into groups(optional) to display the , each representing the health of a device, optionally divided into device groups.

In addition, the dashboard displays the DPOD health state based on Internal Health Alerts.

Assumptions:

The existing alerts infrastructure (with additional few more fields) will provide all data and logic to decide if a sample should be alerted and if it considers an error or warning or good.
The alerts mechanism will be the only source of current and future metrics. New fields: Is alert used as health metric, warning threshold, damage points
Any health metric must be based on a detailed investigation screen

Prerequisites:

Metric:

...

Metrics

The health of each device is based on several user-defined metrics. For example, the CPU of the device.
A metric is basically a criteria and a set of thresholds that together define whether the state of the device in that aspect is one of: Good / Waring / Error.
Metrics are based on the Alerts subsystem of DPOD:
- The user can define which alerts are part of the System Health by selecting the "System Health Metric" option under [Manage → Alert → Setup Alerts → Edit Alert].
- Each alert can be used as a simple alert, as a System Health metric, or both. Using an alert both for alerting and as a System Health metric is recommended, since it makes sure the System Health dashboard will precisely reflect the sent alerts.
The following System Health metrics are defined by default:In addition, there
- Devices CPU Metric
- Devices Memory Metric
- Devices Load Metric
- Devices Fan Metric
- Devices Temperature Metric
- Devices Voltage Metric
- Devices Space Encrypted Metric
- Devices Space Temp Metric
- Devices Space Internal Metric
- System Errors Metric
The user can define whether a metric is part of the system health by selecting the "System Health Metric" option
- The user may edit the metric setting from [Manage → Alert → Setup Alerts → Edit Alert]
- Device Availability Metric - This is an internal metric
called
- based on "Device
Availability" which is not editable, the purpose of this metric is to sample the device availability
- Resources Monitoring" option selected at the device level, that checks whether the device is available or not.

Device Health

...

For each device, the user can define whether the device is displayed in the System Health dashboard, Damage Points Threshold, Total Warnings Threshold
For each device, the user can set thresholds and damage points per health metric. - see device health settings

Device Group Settings:

For each device, the user can define the device group and the display order - see device group settings

System Parameters (DB) - default values for:

Calculation

The System Health dashboard calculates the health of each device in the past hour.
The past hour is divided to 5 parts:
- Last 5 minutes (may be configured via "System Health Dashboard Sample Time Range (min.)"

...

Device Card:

A single device card - includes:
- Health states of the past hour divided to 5 parts, last X minutes and 4 parts of 15 minutes:

...

Image Removed

...

If no Errors or Warnings found in metric samples

...

If no Errors or Warnings found in metric samples

...

Image Removed

...

If warning exists

...

If warning exists

...

Image Removed

...

If error exists
If no errors found but some warning exists then
- If total damage points for all metrics is bigger than the damage points for a device to sustain.
- If total of warning is bigger than number of warning found for a device

...

If error exists
If no errors found but some warning exists then
- If total damage points for all metrics is bigger than the damage points for a device to sustain.
- If total of warning is bigger than number of warning found for a device

...

Image Removed

...

If no metric samples found
If all "Device Availability" metric samples are Error

...

Image Removed + background color of card is red

...

If no metric samples found
If all "Device Availability" metric samples are Error
If the "Device Availability" metric last sample is Error

...

Clicking a device card should direct to the device health dashboard

Device Health Dashboard (drill-down)
- A series of charts to display metric values per device - each metric with its own chart over time
- Each chart is of "Scatter" type, divided to 4 parts (each part represents 15 minutes)
- Each point in the graph should display the right color (green for Good, red for Error etc.) and display the value in the tooltip of the point
- All points should overlap a little so the display is compact and looks like a thick line built of points:
  Image Removed
- Previous 10 minutes
- 3 parts of 15 minutes (the rest of the hour)
Each part displays a single icon with the health of the device during that period of time:

Icon	Description	Last 5 minutes	Other parts
Image Added	Good	No errors or warnings found in metric samples	(same)
Image Added	Warning	Warnings found in metric samples	(same)
Image Added	Error	Errors found in metric samples OR Warnings count exceeded threshold (see below) OR Warnings damage points exceeded threshold (see below)	(same)
Image Added	Unknown	-	No metric samples found (e.g. DPOD alert subsystem was down) OR The device was unavailable during the entire time period
Image Added + red background color	Critical	No metric samples found (e.g. DPOD alert subsystem was down) OR The device was unavailable in the last "Device Availability Metric" sample	-

Each device may have a total warnings threshold which sets the health of that device to Error in case the number of metrics that are at Warning state exceeds that threshold in a specific time period (see Device Health Settings).
Each device may also have a warning damage points threshold which sets the health of that device to Error in case the summary of the damage points of all metrics that are at Warning state exceeds that threshold in a specific time period (see Device Health Settings).
- Each System Health metric may be assigned with damage points, which should reflect the severity of that warning.
For each device, the user can set thresholds and damage points per health metric, which override the default thresholds and damage points defined at the System Health metric level (see Device Health Settings).

Devices Display Options

For each device, the user can define whether the device is displayed in the System Health dashboard.
The user may define device groups:
- Each device group has a name and a display order of that group
- Devices are assigned to one or more device groups with a defined display order
- For example: Production, Non-production

Device Health Dashboard

Clicking a device card in the System Health dashboard opens the Device Health dashboard which displays a detailed view of a specific device health.
This dashboard displays all metrics that were part of the device health calculation.
For each metric, all samples are displayed in a chart, which displays each sample values when hovering.
For analyzing a System Health metric, the user may click "Analyze" when hovering over the metric values. This will take the user to the appropriate dashboard for analyzing the values.
- Each System Health metric may be assigned with a drill-down dashboard.

Versions Compared

Old Version 7

New Version Current

Key

Assumptions:

Prerequisites:

Metrics

Device Health

Calculation

Device Card:

Devices Display Options

Device Health Dashboard

Page Comparison

Versions Compared

Old Version 7

New Version Current

Key

Assumptions:

Prerequisites:

Metrics

Device Health

Calculation

Device Card:

Devices Display Options

Device Health Dashboard