eG Monitoring
 

Measures reported by CassKeySpaceTest

A keyspace in Cassandra is a namespace that defines data replication on nodes. A cluster contains one keyspace per node. CQL stores data in tables (SSTables, memtable), whose schema defines the layout of said data in the table, and those tables are grouped in keyspaces. A keyspace defines a number of options that applies to all the tables it contains, most prominently of which is the replication strategy used by the keyspace. It is generally encouraged to use one keyspace by application, and thus many cluster may define only one keyspace.

The keyspace is the top-level database object that controls the replication for the object it contains at each datacenter in the cluster. Keyspaces contain tables, materialized views and user-defined types, functions and aggregates.

In the read path, Cassandra merges data on disk (in SSTables) with data in RAM (in memtables). To avoid checking every SSTable data file for the partition being requested, Cassandra employs a data structure known as a bloom filter. Bloom filters are maintained per SSTable, i.e. each SSTable on disk gets a corresponding bloom filter in memory.

Bloom filters are a probabilistic data structure that allows Cassandra to determine one of two possible states: - The data definitely does not exist in the given file, or - The data probably exists in the given file. While bloom filters can not guarantee that the data exists in a given SSTable, bloom filters can be made more accurate by allowing them to consume more RAM. As accuracy improves (as the bloom_filter_fp_chance (bloom filter false positive) gets closer to 0), memory usage increases non-linearly i.e., the bloom filter with a bloom_filter_fp_chance = 0.01 requires about three times as much memory as the same table with bloom_filter_fp_chance = 0.1. If the bloom filter false positives increases rapidly, the memory usage may decrease and the disk overhead increase manifold. Therefore, it is essential to contain the bloom filter false positives before the disk is bombarded with requests. Similarly, the read requests and write requests in each keyspace also should be monitored at a closer pace so that administrators can ensure that the data is available in the keyspace. This will ensure a reduced disk overhead for the requests received. The CassKeySpaceTest test helps administrators in monitoring the keyspace and containing the bloom filter false positives!

This test auto-discovers the keyspaces in the target Cassandra Database node and for each keyspace, this test reports the count of SSTables and memory tables available. In addition, this test reveals the count of bloom filter false positives on each keyspace and the space utilization of the bloom filters in depth. The test also provides insights into the read and write latency of each keyspace so that administrators can get an idea of the keyspace that is lagging behind in catering the requests.

Ouputs of the test: One set of results for the target Cassandra Database node being monitored.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Blmfilter_false_postive Indicates the number of bloom filter false positives in this keyspace. Number Typical values for bloom_filter_fp_chance are usually between 0.01 (1%) to 0.1 (10%) false-positive chance, where Cassandra may scan an SSTable for a row, only to find that it does not exist on the disk. The parameter should be tuned by use case:
  • Users with more RAM and slower disks may benefit from setting the bloom_filter_fp_chance to a numerically lower number (such as 0.01) to avoid excess IO operations.
  • Users with less RAM, more dense nodes, or very fast disks may tolerate a higher bloom_filter_fp_chance in order to save RAM at the expense of excess IO operations.
  • In workloads that rarely read, or that only perform reads by scanning the entire data set (such as analytics workloads), setting the bloom_filter_fp_chance to a much higher number is acceptable.
Blmfilter_false_pst_rate Indicates the bloom filter false positive ratio in this keyspace. Percent A low value is desired for this measure.
Blmfilter_space_used Indicates the disk space used by the bloom filter in this keyspace. MB A high value indicates that the data is available in the keyspace.
Live_SStable_count Indicates the number of SSTables that are currently live/active in this keyspace. Number Compare the value of this measure across the keyspaces to figure out the keyspace on which there are too many SSTables that are active/live.
Live_dsk_spce_used Indicates the disk space utilized by the SSTables that are live/active in this keyspace. MB A continuously increasing value of this measure indicates that the SSTables are upto-date with the data
Memtab_Col_count Indicates the number of columns present in the memory table available in this keyspace. Number  
Memtab_switch_count Indicates the number of flushes in memory table per second that resulted in the switch out of the memory table available in this keyspace. Switches/second  
Memtable_live_data_size Indicates the size of the data stored in the memory table available in this keyspace. MB A continuously increasing value of this measure indicates that the memory tables are not updating the data to the SSTables. Administrators should therefore check if adequate space is allocated to the SSTables.
Memtable_off_heap_size Indicates the off-heap memory size of the memory table available in this keyspace. MB  
Memtable_on_heap_size Indicates the on-heap memory size of the memory table available in this keyspace. MB  
Recent_blmfltr_fals_pstv Indicates the recent number of bloom filter positives negotiated in this keyspace. Number  
Recnt_blmfltr_falspt_rat Indicates the recent bloom filter false positive ratio negotiated in this keyspace. Percent  
Avg_read_latency Indicates the average time taken by this keyspace to respond to read requests. Milliseconds/request Compare the value of this measure across the keyspaces to determine the keyspace that is taking too long to respond to read requests.
Read_lat_99thpct Indicates the average 99th percentile time taken by this keyspace to respond to user requests. Milliseconds  
Avg_write_latency Indicates the average time taken by this keyspace to write the data for the requests. Milliseconds/request Compare the value of this measure across keyspaces to figure out the keyspace that is taking too long to write the data for the requests received.
Write_lat_99thpct Indicates the average 9th percentile time taken by this keyspace to respond to each write request. Milliseconds  
Avg_range_latency Indicates the average time taken by this keyspace to respond to a range of requests. Milliseconds/request  
Range_lat_99thpct Indicates the average 99th percentile time taken by this keyspace to respond to a range of user requests. Milliseconds  
Range_lat_99thpct Indicates the average 99th percentile time taken by this keyspace to respond to a range of user requests. Milliseconds