|
Measures reported by HdpRMRpcActTest
ResourceManager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers (NMs) and the per-application ApplicationMasters (AMs).
Clients communicate with the RM via RPC to submit applications, terminate applications, obtain queue information, and to retrieve cluster statistics. Nodes in the cluster interact with the RM over RPC for registration, for submitting resource requests, and for routing heartbeats to the YARN
scheduler. ApplicationMasters also communicate with the RM via RPC for registration and for submitting termination / unregister / cointainer-allocation / container-deallocation requests to the YARN scheduler. Additionally, the RM also manages secret-keys used to authenticate/authorize requests on various RPC interfaces.
Typically, upon receipt of the RPC calls, the RM puts them in a queue for execution. If the RM is unable to process the requests in queue quickly, the queue length will keep increasing. If this processing bottleneck is not resolved rapidly, the RM may end up being overloaded with RPC requests, which may eventually cause the RM to choke and fail to respond. Administrators hence need to keep an eye on the RPC operations performed via each RPC interface on the RM, so they can promptly capture overload conditions and latent RPC activities.
Also, since the RM manages the authentication/authorization requests, administrators need to be able to rapidly capture and investigate authentication/authorization failures, so that the cluster is protected from malicious attacks.
With the help of the HdpRMRpcActTest test, administrators can observe RPC activity on each RPC interface. In the process, they can:
Identify the exact RPC interface that is overloaded with connections;
Detect slowness in RPC request processing well before users complain, and precisely pinpoint the latent interface;
Be alerted to repeated authentication/authorization failures on any interface
Outputs of the test : One set of the results for each RPC interface on the Resource Manager
The measures made by this test are as follows:
| Measurement |
Description |
Measurement Unit |
Interpretation |
| Rpc_call_rate |
Indicates the rate at which RPC calls were received by the RM via this interface. |
Calls/Sec |
A high value is indicative of high RPC activity on an interface. |
| Avg_queue_time |
Indicates the average time RPC requests received via this interface spent in the queue. |
Milliseconds |
If the value of this measure grows continuously, it is indicative of latency in request processing. |
| Avg_proc_time |
Indicates the average processing time of RPC requests received via this interface. |
Milliseconds |
If the value of the Average queue time measure increases consistently for an interface, then take a look at the value of this measure for the same interface. If the value of this measure is also increasing alongside the value of the Average queue time measure for an interface, it is a clear indication of a processing bottleneck on that interface. |
| Authen_succ_rate |
Indicates the rate at which RPC interactions via this interface were successfully authenticated |
Successes/Sec |
Ideally, the value of this measure should be high. |
| Authen_fail_rate |
Indicates the rate at which RPC communications via this interface failed authentication. |
Failures/Sec |
A low value is desired for this measure. A significant and unexpected spike in this value
could indicate attempts to hack the cluster. Such accesses should be pulled up for closer scrutiny. |
| Authorz_succ_rate |
Indicates the rate at which RPC calls made via this interface were successfully authorized. |
Successes/Sec |
Ideally, the value of this measure should be high. |
| Authorz_fail_rate |
Indicates the rate at which RPC calls made via this interface failed authorization. |
Failures/Sec |
A low value is desired for this measure. A significant and unexpected spike in this value could
indicate attempts to hack the cluster. Such accesses should be pulled up for closer scrutiny. |
| Queue_length |
Indicates the count of RPC calls received via this interface that are in queue currently. |
Number |
If this value keeps increasing with time for any interface, it indicates that RPC request processing is probably bottlenecked on that interface. |
| Num_open_conns |
Indicates the count of RPC connections currently open on this interface. |
Number |
This is a good indicator of the current load on an interface. Compare the value of this measure
across interfaces to identify the overloaded interface. You can also use the detailed diagnosis of this measure to which user has how many connections open via the interface. In the event of an overload, these detailed metrics will point you to the precise user responsible for it. |
|