eG Monitoring
 

Measures reported by HdpDatNodActTest

Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes).

DataNodes are the slave nodes in HDFS. The actual data is stored on DataNodes. A functional filesystem has more than one DataNode, with data replicated across them.

On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations. Local and remote client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files. Also, the DataNodes periodically perform block verification to identify corrupt blocks.

The NameNode also initiates replication of blocks on the DataNodes as and when necessary. Moreover, DataNodes also cache blocks in off-heap caches based on caching instructions they receive from the NameNode.

Each of these operations impose load on a DataNode. Since I/O load is uniformly distributed across the DataNodes in a Hadoop cluster, an administrator needs to closely observe the I/O activity on every DataNode, so they can promptly capture load-balancing irregularities. Administrators should also assess how each DataNode is processing the I/O requests, so that they can proactively detect bottlenecks in request servicing in any DataNode. They should also check if block verification has occurred on any DataNode, so they can quickly detect verification failures. Furthermore, as block caching is a healthy exercise, administrators should also ensure that adequate blocks are cached on every DataNode. With the help of HdpDatNodActTest test, administrators can monitor all the activities discussed above on every DataNode. In the process, administrators can rapidly identify overloaded DataNodes, slow DataNodes, those where block verification has failed, and those where caching is sub-optimal.

Outputs of the test : One set of the results for each DataNode in the target Hadoop cluster

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Data_read_rate Indicates the rate at which data was read from this DataNode. Reads/Sec Compare the value of these measures across DataNodes to identify the node that is slowest in processing read requests.
Block_read_rate Indicates the rate at which blocks were read from this DataNode. Blocks/Sec
Read_oper_localclient Indicates the rate at which local clients performed read operations on this DataNode. Operations/Sec Compare the values of these measures across DataNodes to figure out if the load of read operations is uniformly distributed across all DataNodes. In the process, you can identify those DataNodes that are handing a significantly more read requests than the rest.
Read_oper_remoteclient Indicates the rate at which remote clients performed read operations on this DataNode. Operations/Sec
Data_write_rate Indicates the rate at which data was written to this DataNode. Reads/Sec Compare the value of these measures across DataNodes to identify the node that is slowest in processing write requests.
Block_write_rate Indicates the rate at which blocks were written to this DataNode. Blocks/Sec
Write_oper_localclient Indicates the rate at which local clients performed write operations on this DataNode. Operations/Sec Compare the values of these measures across DataNodes to figure out if the load of write operations is uniformly distributed across all DataNodes. In the process, you can identify those DataNodes that are handing a significantly more write requests than the rest.
Write_oper_remoteclient Indicates the rate at which remote clients performed write operations on this DataNode. Operations/Sec
Block_replicate_rate Indicates the rate at which blocks in this DataNode were replicated. Blocks/Sec  
Block_remove_rate Indicates the rate at which blocks were removed from this DataNode. Blocks/Sec  
Blocks_verified_rate Indicates the rate at which blocks on this DataNode were verified. Blocks/Sec Block Verification is basically used to identify corrupt DataNode Block. During a write operation, when a DataNode writes in to the HDFS, it verifies a checksum for that data. This checksum helps in verifying the data corruptions during the data transmission. When the same data is read from the HDFS, the client verifies the checksum returned by the DataNode against the checksum it calculates against the data to check the data corruption that might have caused by the DataNode that might have occurred during the storage of data in the DataNode. Therefore, every DataNode periodically runs a block verification, to verify all the blocks that are stored in the DataNode. So this helps to identify and fix the corrupt data before a read operation. With the block verification service, HDFS can prematurely identify and fix corruptions.
Block_verfiy_fail_rate Indicates the rate at which block verifications failed on this DataNode. Blocks/Sec If block verification fails, it may result in corrupted blocks continuing to remain on a DataNode. This means that ideally, the value of this measure should be 0.
Blocks_cached_rate Indicates the rate at which blocks on this node were cached. Blocks/Sec Whenever a request is made to a DataNode to read a block, the DataNode will read the block from disk. If you know that you will read the block many times, it is good idea to cache the block in memory. Hadoop allows you to cache a block. You can specify which file to cache (or directory) and for how long (the block will be cached in off-heap caches). Hadoop provides centralized cache management to manage the block caches. All the cache requests are made to NameNode. The NameNode will instruct the respective DataNodes to cache the blocks in off-heap caches.

Caching minimizes the I/O processing overheads and improves the overall performance of a DataNode. This is why, a high rate of caching on a DataNode is ideal. This means a high value is desired for this measure. If the value of this measure is very low for a DataNode, it may potentially cause that node to process requests slowly.
Blocks_Uncached_rate Indicates the rate at which the blocks on this DataNode are unached. Blocks/Sec