eG Monitoring
 

Measures reported by HdpDatNodHrtbtTest

A 'heartbeat' is a signal sent between a DataNode and NameNode. This signal is taken as a sign of vitality. If there is no response to the signal, then it is understood that there are certain health issues/technical problems with the DataNode or the TaskTracker.

The default heartbeat interval is 3 seconds. If the NameNode does not receive any heartbeats from a DataNode for a period of 10 minutes, then a 'Heartbeat Lost' condition occurs and the corresponding DataNode is deemed to be dead/unavailable.

To avoid the loss of heartbeats and the consequent failure of a DataNode, administrators must keep a close watch on the heartbeats sent by each DataNode to the NameNode, detect issues in the transmission of heartbeats, and clear the bottlenecks well before the configured timeout period expires and the DataNode is declared dead. This can be achieved using the HdpDatNodHrtbtTest test!

This test monitors the heartbeats that each DataNode sends to the NameNode. In the process, the test reports the count of heartbeats that every DataNode sends during a measure period, the rate at which the heartbeats were sent, and the average time taken for the transmission. Alerts are promptly sent out if a DataNode does not send out any heartbeat or takes too much time to do so. This way, administrators can proactively detect problems in heartbeat communication and can resolve them before DataNodes die.

Outputs of the test : One set of the results for each DataNode in the target Hadoop cluster

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Hear_beats Indicates the count of heart beats sent by this DataNode to the NameNode during the last measurement period. Number By default, heartbeats are sent every 3 seconds. The default frequency of a test is 5 minutes. In the default scenario therefore, a 100 heartbeats (300 / 3) should have been sent in a single measure period. If lesser or no heartbeats were sent during a measure period, it could imply a problem with the DataNode. If the heartbeat loss occurred owing to the disk failure on the DataNode, then you may have to replace a disk on the DataNode host or perform a disk hot swap for DataNodes. If a DataNode could not send heartbeats for any other reason, then you may have to recommission that DataNode to add it back to the cluster.
Heart_beat_rate Indicates the rate at which this DataNode sent heartbeats to the NameNode. Heartbeats/Sec  
Avg_heart_beat_time Indicates the average time to send a heartbeat from this DataNode to the NameNode. Milliseconds A high value or a consistent increase in the value of this measure is a cause for concern, as it means that the DataNode is sending heartbeats slowly to the NameNode. A bad network connection between the DataNode and NameNode is one of the common causes for slow transmission of heartbeats.