eG Monitoring
 

Measures reported by HdpAppStatsTest

Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system.

YARN provides its core services via two types of long-running daemon: a resource manager (one per cluster) to manage the use of resources across the cluster, and node managers running on all the nodes in the cluster to launch and monitor containers. A container executes an applicationspecific process with a constrained set of resources (memory, CPU, and so on).

To run an application on YARN, a client contacts the resource manager and asks it to run an application master process. The resource manager is the master daemon of YARN and is responsible for resource assignment and management among all the applications. Whenever it receives a processing request, it forwards it to the corresponding node manager and allocates resources for the completion of the request accordingly. The node manager launches the application master in a container. Precisely what the application master does once it is running depends on the application. It could simply run a computation in the container it is running in and return the result to the client. Or it could request more containers from the resource managers, and use them to run a distributed computation.

In an ideal world, the requests that a YARN application makes would be granted immediately. In the real world, however, resources are limited, and on a busy cluster, an application will often need to wait to have some of its requests fulfilled. The Scheduler component of the resource manager manages resource allocations to applications. Typically, the scheduler places applications in queues and allocates resources to them based on pre-defined policies.

The true test of the efficiency of the scheduler lies in how quickly it makes resources available to the applications/jobs in queues. Delay in resource allocation is one of the common causes for application jobs to fail or remain pending in the queues for a long time. By tracking the status of jobs in queues, an administrator can proactively detect potential application slowness. This is where the HdpAppStatsTest test helps!

This test auto-discovers the queues spawned by the YARN scheduler, and for each queue, reports the count of jobs that are in different states of activity. In the process, the test points administrators to those queues that have one/more failed jobs, long-running jobs, and those with jobs that have not been assigned a container yet. This way, administrators can be proactively alerted to potential issues in resource allocations by the scheduler.

Outputs of the test : One set of the results for each queue spawned by the scheduler component of the resource manager of the target Hadoop cluster.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Submitted_applications Indicates the number of jobs submitted to this queue during the last measurement period. Number This is a good indicator of the workload on the queue.
Completed_applications Indicates the number of jobs in this queue that were completed during the last measurement period. Number Ideally, the value of this measure should be high.
Failed_applications Indicates the number of jobs in this queue that failed during the last measurement period. Number Ideally, the value of this measure should be 0. A non- zero value indicates that one/more jobs in a queue have failed.

AM (Application Master) container is the brain of the yarn job. It controls the whole life cycle of this job.

For any yarn job failure issue or performance issue, always start by checking AM log. Then, to know which application jobs failed, look up the client log, eg Hive log, Spark log or custom application log. You can identify the node on which the AM container was run by checking the RM log or the RM UI for the Yarn job. You can glean which container failed from the AM container log. To know where is the failed container, use the RM log.
Killed_applications Indicates the number of jobs in this queue that were killed during the last measurement period. Number  
Running_applications Indicates the number of jobs in this queue that are currently running. Number  
Pending_applications Indicates the number of jobs in this queue that are pending processing. Number A low value is desired for this measure. A high value indicates that many jobs are pending processing. This could increase the length of the job queue. A consistent increase in the value of this measure hints at a delay in resource allocation by the scheduler, causing the jobs to wait too long to run.
Running_0to60 Indicates the number of jobs in this queue that have been running for less than 60 minutes. Number  
Running_60to300 Indicates the number of jobs in this queue that have been running for a duration of 1 - 5 hours. Number  
Running_300to1400 Indicates the number of jobs in this queue that have been running for a duration of 5 - 24 hours. Number Ideally, the value of these measures should be low. A high value could indicate problems in resource allocation.
Running_1400_more Indicates the number of jobs in this queue that have been running for over 24 hours. Number