eG Monitoring
 

Measures reported by HdpRMUptimeTest

The resource manager is the master daemon of YARN and is responsible for resource assignment and management among all the applications. Whenever the resource manager receives a processing request, it forwards it to the corresponding node manager and allocates resources for the completion of the request accordingly. This means that an unscheduled reboot or unexpected downtime of the resource manager can adversely impact resource allocation to applications. Similarly, if a scheduled reboot does not occur, resource manager performance may suffer. This in turn can cause applications to not run.

To avoid this, administrators must keep an eye out for problem conditions such as a sudden reboot of the resource manager or an extended uptime of the resource manager. This can be achieved using the HdpRMUptimeTest test. This test continuously tracks and reports the uptime of the resource manager, thus revealing whether/not a scheduled reboot occurred. Likewise, the test also alerts administrators to unexpected reboots, so they can quickly investigate and determine the reason for the same and eliminate it. This way, the test helps administrators ensure the high uptime of the resource manager.

Outputs of the test : One set of the results for the target Hadoop cluster

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Is_rebooted Indicates whether the resource manager has been rebooted during the last measurement period or not.   If this measure shows 1, it means that the resource manager was rebooted during the last measurement period. By checking the time periods when this metric changes from 0 to 1, an administrator can determine the times when the resource manager was rebooted.

The detailed diagnosis of this measure, if enabled, displays when the resource manager was shutdown, for how long it was shutdown, when it was restarted, and whether/not the resource manager is in maintenance.
Diff_uptime Indicates the time period that the resource manager has been up since the last time this test ran. Seconds If the resource manager has not been rebooted during the last measurement period and the agent has been running continuously, this value will be equal to the measurement period. If the resource manager was rebooted during the last measurement period, this value will be less than the measurement period of the test. For example, if the measurement period is 300 secs, and if the resource manager was rebooted 120 secs back, this metric will report a value of 120 seconds. The accuracy of this metric is dependent on the measurement period - the smaller the measurement period, greater the accuracy.
Uptime Indicates the total time that the resource manager has been up since its last reboot.   This measure displays the number of years, months, days, hours, minutes and seconds since the last reboot. Administrators may wish to be alerted if a server has been running without a reboot for a very long period. Setting a threshold for this metric allows administrators to determine such conditions.