eG Monitoring
 

Measures reported by CassHintTest

Over time, data in a Cassandra replica can become inconsistent with other replicas due to the distributed nature of the database. Node repair corrects the inconsistencies so that eventually all nodes have the same and most upto-date data. It is important part of regular maintenance for every Cassandra cluster.

Cassandra provides the following repair processes:

  • Hinted Handoff
  • Read Repair
  • Anti-Entropy Repair

Occasionally, a node may become unresponsive while data is being written. This unresponsiveness may be due to hardware problems, network issues, or overloaded nodes that experience long garbage collection (GC) pauses. If a node is unable to receive a particular write, the write's coordinator node preserves the data to be written as a set of hints. When the node comes back online, the coordinator effects repair by handing off hints so that the node can catch up with the required writes. This type of repair process is termed as Hinted Handoff. The handing off hints will be happening for a period given by the max_hint_window_ms setting in cassandra.yaml. Once this window expires, nodes will stop saving hints.

Hinted Handoff is an optional part of writes whose primary purpose is to provide extreme write availability when consistency is not required. Secondarily, Hinted Handoff can reduce the time required for a temporarily failed node to become consistent again with live ones. This is especially useful when a flaky network causes false-positive failures. If the hinted handoff is not enabled, then, the node may contain outdated data for a longer duration which may result in users using stale data which may result in a dip in user experience. It is therefore necessary to monitor the status of the hinted handoff round the clock. The CassHintTest test helps administrators in this regard!

By closely monitoring the Casssandra Database node, this test helps administrators to figure out if the hinted handoff is enabled or not. In addition, this test reports the total number of hints that the node needs to be updated with and the number of hints that are active for replay. If there is an abnormal increase in the count of hints, administrators may infer that there is a potential database performance degradation.

Ouputs of the test: One set of results for the target Cassandra Database node being monitored.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Is_hint_handoff_enabled Indicates whether/not the hinted handoff is enabled.   The values that this measure can report and their corresponding numeric values have been listed in the table below:

Numeric Value Measure Value
0 No
1 Yes

Note:

By default, this measure reports the above-mentioned Measure Values to indicate whether/not the hinted handoff is enabled. However, in the graph of this measure, the same will be represented using the numeric equivalents - 0 or 1 only.

Hints_in_progress Indicates the number of hints that are currently active to replay. Number Ideally, the value of this measure should be 0.
Total_hints Indicates the total number of hints. Number A high value for this measure indicates that too many statements are prepared and most of them could not be stored in the prepared statement cache. This is mainly due to the inadequate configuration of the cache size. Administrators may therefore increase the size of the prepared statement cache to keep the value of this measure to a minimum.