eG Monitoring
 

Measures reported by CassRdRprTest

Over time, data in a replica can become inconsistent with other replicas due to the distributed nature of the Cassandra database. Node repair corrects the inconsistencies so that eventually all nodes have the same and most up-to-date data. It is important part of regular maintenance for every Cassandra cluster. Read repair improves consistency in a Cassandra cluster with every read request.

In a read, the coordinator node sends a data request to one replica node and digest requests to others for consistency level (CL) greater than ONE. If all nodes return consistent data, the coordinator returns it to the client.

In read repair, Cassandra sends a digest request to each replica not directly involved in the read. Cassandra compares all replicas and writes the most recent version to any replica node that does not have it. If the query's consistency level is above ONE, Cassandra performs this process on all replica nodes in the foreground before the data is returned to the client. Read repair repairs any node queried by the read. This means that for a consistency level of ONE, no data is repaired because no comparison takes place. For QUORUM, only the nodes that the query touches are repaired, not all nodes.

There are three types of read requests that a coordinator can send to a replica:

  • A direct read request
  • A digest request
  • A background read repair request

In a direct read request, the coordinator node contacts one replica node. Then the coordinator sends a digest request to a number of replicas determined by the consistency level specified by the client. The digest request checks the data in the replica node to make sure it is up to date. Then the coordinator sends a digest request to all remaining replicas. If any replica nodes have out of date data, a background read repair request is sent. Read repair requests ensure that the requested row is made consistent on all replicas involved in a read query.

In some environments, at times, due to network issues or due to failure of multiple nodes, the data may not be replicated to all nodes. If suppose a node becomes available after a short hiatus, there may be a sudden influx of read repair requests to the node so that the outdated data in the node can be replaced with the data that is up to date. This sudden influx of read repairs is a cause of concern when the nodes are not updated at regular intervals. Therefore, it is essential to monitor the read repairs frequently. To identify such erratic behavior in read repair requests and the background repairs performed in the nodes, administrators can use the CassRdRprTest test.

Using this test, administrators can figure out the count of background read repairs coordinated by the node and the read repairs attempted by the node. In addition, this test also reveals the full data digests coordinated by the node. By analyzing the count of read repairs at regular intervals, administrators can figure out the erratic pattern of read repairs performed on the node and figure out the exact cause of such erratic behavior in updating the node with the latest data information.

Ouputs of the test: One set of results for each Request type on the target Cassandra Database node being monitored.

Descriptor of the test: Request type

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Background_repaired Indicates the number of background read repairs coordinated by the node. Number A high value for this measure indicates that the data requested is out of date and is being updated. This may indicate that there was a delay in update process or a sudden unavailability of the node due to network issues etc.
Read_repair_attempts Indicates the number of read repairs attempted by the node. Number A high value indicates that the read repair is carried out to update the replica nodes that are out-of-date. This may also indicate that there was a delay in update process or a sudden unavailability of the node due to network issues etc.
Repaired_blocking Indicates the number of full data digests coordinated by the node. Number On any consistency level that involves more than one node (i.e., all except ANY and ONE), if the read digests do not match up, read repair is done in a blocking fashion before returning results. This means that if the repair does not complete on time, the read requests fail.