eG Monitoring
 

Measures reported by HdpNodMgrTest

The Hadoop Yarn Node Manager is the per-machine/per-node framework agent who is responsible for containers, monitoring their resource usage and reporting the same to the ResourceManager. NodeManager also tracks the health of the node on which it is running, controls auxiliary services which different YARN applications may exploit at any point in time. NodeManager can execute any computations that make sense to ApplicationMaster just by creating the container for each task.

The NodeManager runs services to determine the health of the node it is executing on. The services perform checks on the disk as well as any user specified tests. If any health check fails, the NodeManager marks the node as unhealthy and communicates this to the ResourceManager, which then stops assigning containers to the node. Communication of the node status is done as part of the heartbeat between the NodeManager and the ResourceManager. Based on the status reports received from the NodeManager, the ResourceManager schedules jobs and allocates resources to the nodes.

Administrators need to be able to quickly spot unhealthy NodeManagers, so they can dig deep and figure out which health check failed and why. Administrators also need to ensure that the communication between the NodeManager and ResourceManager is alive at all times, as a break or delay in transmission of heartbeats can severely impair the ResourceManager's operations. This is where the HdpNodMgrTest test helps!

This test monitors the NodeManagers running in a cluster and reports the count of managers in different states. The administrator is notified if even one manager is unhealthy, inactive, or incommunicado. The test further reveals the count of managers that have been and/or are being decommissioned, so that administrators can keep track of the progress of a cluster down-scaling exercise that they may have triggered.

Outputs of the test : One set of the results for the Hadoop cluster being monitored

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Total_node_mgr Indicates the total number of NodeManagers in the cluster. Number  
Active_node_mgr Indicates the number of NodeManagers in the cluster that are currently active. Number Ideally, this value should be close to the value of the Total node managers measure.
Unhealthy_node_mgr Indicates the number of NodeManagers for which one/more health checks failed. Number Ideally, the value of this measure should be 0. A non-zero value implies that one/mode nodes in the cluster are unhealthy, and hence unavailable to store data. To ensure uninterrupted storage services, you may have to identify the node that is unhealthy and diagnose the reason for its poor health. Use the HdpNodMgrTest test to identify the unhealthy node.
Lost_node_mgr Indicates the number of NodeManagers in the cluster that are currently lost. Number If the NodeManager on a node has not sent heartbeats to the ResourceManager beyond a configured period of time, then such a node/NodeManager is considered as lost.

Ideally, the value of this measure should be 0.
Rebooted_node_mgr Indicates the number of NodeManagers that were rebooted. Number  
Decomissioning_node_mgr Indicates the number of NodeManagers that are being decommissioned. Number This refers to NodeManagers for which decommissioning is in progress.

Typically, lost nodes are decommissioned. Decommissioning is also performed as part of a regular cluster down-sizing procedure.
Decommisioned_node_mgr Indicates the number of NodeManagers that have been decommissioned. Number