eG Monitoring
 

Measures reported by HdpNNRpcActTest

Client communicates with the NameNode in a Hadoop cluster using the RPC protocol. Similarly, DataNodes and the NameNode in the cluster exchange data using the RPC protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or client.

Typically, clients and DataNodes request for HDFS operations by initiating an RPC call on the NameNode. Upon receipt of the RPC calls, the NameNode puts them in a queue for execution by reader threads. If the NameNode is unable to process the requests in queue quickly, the queue length will keep increasing. If this processing bottleneck is not resolved rapidly, the NameNode may end up being overloaded with RPC requests, which may eventually cause the NameNode to choke and fail to respond. Administrators hence need to keep an eye on the RPC operations performed via each RPC interface, so they can promptly capture overload conditions and latent RPC activities.

Also, in highly secure environments, it is common practice to authenticate and authorize RPC interactions between the clients and the cluster, so as to protect them from eavesdropping. Authentication requires end-users/clients interacting with Hadoop over RPC to be authenticated by Kerberos. Authorization ensures that end- users/clients have the necessary, pre- configured permissions to access the service. Frequent authentication and/or authorization failures should be viewed suspiciously, as they may be attempts to hack the cluster. Administrators should be on guard and should swoop down on such failures.

With the help of the HdpNNRpcActTest test, administrators can observe RPC activity on each RPC interface. In the process, they can:

  • Identify the exact RPC interface that is overloaded with connections;

  • Detect slowness in RPC request processing well before users complain, and precisely pinpoint the latent interface;

  • Be alerted to repeated authentication/authorization failures on any interface

Outputs of the test : One set of the results for each RPC interface on the NameNode

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Rpc_call_rate Indicates the rate at which RPC calls were received by the NameNode via this interface. Calls/Sec A high value is indicative of high RPC activity on an interface.
Avg_queue_time Indicates the average time RPC requests received via this interface spent in the queue. Milliseconds If the value of this measure grows continuously, it is indicative of latency in request processing.

By default, the NameNode puts RPC requests in a FIFO call queue. Though FIFO is fair in the sense of first-come-first-serve, it is unfair in the sense that users who perform more I/O operations on the NameNode will be served more than users who perform less I/O. This means that a single user submitting a very large number of requests can easily overwhelm the service, causing degraded service for all other users.

This situation can be improved by replacing the FIFO queue with a the Fair Call Queue. In this approach, the FAIR queue places incoming RPC calls into a number of queues based on the call volume of the user who made the call. The scheduler keeps track of recent calls, and prioritizes calls from lighter users over calls from heavy users.
Avg_proc_time Indicates the average processing time of RPC requests received via this interface. Milliseconds If the value of the Average queue time measure increases consistently for an interface, then take a look at the value of this measure for the same interface. If the value of this measure is also increasing alongside the value of the Average queue time measure for an interface, it is a clear indication of a processing bottleneck on that interface.
Authen_succ_rate Indicates the rate at which RPC interactions via this interface were successfully authenticated Successes/Sec Ideally, the value of this measure should be high.
Authen_fail_rate Indicates the rate at which RPC communications via this interface failed authentication. Failures/Sec Ideally, the value of this measure should be high.
Authorz_succ_rate Indicates the rate at which RPC calls made via this interface were successfully authorized. Successes/Sec Ideally, the value of this measure should be high.
Authorz_fail_rate Indicates the rate at which RPC calls made via this interface failed authorization. Failures/Sec A low value is desired for this measure. A significant and unexpected spike in this value could indicate attempts to hack the cluster. Such accesses should be pulled up for closer scrutiny.
Queue_length Indicates the count of RPC calls received via this interface that are in queue currently. Number If this value keeps increasing with time for any interface, it indicates that RPC request processing is probably bottlenecked on that interface.
Num_open_conns Indicates the count of RPC connections currently open on this interface. Number This is a good indicator of the current load on an interface. Compare the value of this measure across interfaces to identify the overloaded interface. You can also use the detailed diagnosis of this measure to which user has how many connections open via the interface. In the event of an overload, these detailed metrics will point you to the precise user responsible for it.