eG Monitoring
 

Measures reported by HdpNodMgrStTest

The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (CPU, memory, disk, network), and reporting the same to the ResourceManager.

The NodeManager runs services to determine the health of the node it is executing on. The services perform checks on the disk as well as any user specified tests. If any health check fails, the NodeManager marks the node as unhealthy.

The HdpNodMgrStTest test monitors the NodeManager on each slave node in a cluster, retrieves the results of the health checks the NodeManager performs on each node, and reports the current status of that node. This way, the test promptly alerts administrators to unhealthy nodes.

Outputs of the test : One set of the results for each slave node in the cluster

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Health_status Indicates the health status of this node as reported by the NodeManager.   The values that this measure reports and their corresponding numeric values are as follows:

Measure Value Numeric Value
Unhealthy 0
Healthy 1

Note:

By default, this measure reports the Measure Values listed above to indicate how healthy a node is. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Last_update_interval Indicates the time period (in seconds) that has elapsed since the last time the NodeManager on this node communicated the health status to the resource manager. Seconds Ideally, the value of this measure should be low. A high value indicates that the NodeManager has not reported the node status to the resource manager for a long time. This is a cause for concern, as it implies that there could be operational issues with the NodeManager. A thorough investigation is warranted.