eG Monitoring
 

Measures reported by AzrRedisCacheTest

The Azure Redis Cache is a high-performance caching service that provides in -memory data store for faster retrieval of data. It is based on the open-source implementation Redis cache. This ensures low latency and high throughput by reducing the need to perform slow I/O operations. It also provides high availability, scalability, and security.

When a user uses an application, the application tries to read data from the cache. If the requested data is not available in the cache, the application gets the data from the actual data source. Then the application stores that data in the cache for subsequent requests. When the next request comes to an application, it retrieves data from the cache without going to the actual data source. This process improves the application performance because data lives in memory. Also, it increases the availability of an application in case of unavailability of the database.

If the cache does not serve its purpose - i.e., if the cache repeatedly fails to service application requests or is slow when at it - the performance of the application will significantly deteriorate. To ensure the high availability and optimal usage of the cache, administrators need to track requests to and responses from the cache, promptly detect cache misses and latencies, diagnose the reasons for the same, and eliminate those reasons quickly. This is where the AzrRedisCacheTest helps!

For each Redis cache that is created for a target Azure subscription, this test tracks the status of that cache, and alerts administrators if any cache is in an Error/Unknown state. The test also monitors the requests to and responses from every cache. In the process, the test measures the time taken by each cache to process application requests. If any cache takes an unusually long time to service requests, the test notifies administrators of the same. Similarly, if any cache fails to service requests consistently, the same is brought to the attention of the administrator. Additionally, the test also reveals the probable reasons for these anomalies - is it because the cache is running out of memory? is it because the cache is blocking/rejecting connections to it? is it owing to heavy load on the cache? or is it because of a failed failover attempt? The test also reveals if the cache is sized with adequate bandwidth and CPU for its operations. Alerts are sent out if the cache is running low on such critical resources.

Outputs of the test : One set of results for every Azure Cache for Redis that is configured in each resource group of the target Azure subscription

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Redis_Status Indicates the current status of this cache.   The values reported by this measure and its numeric equivalents are mentioned in the table below:

Measure Value Numeric Value
Succeeded 1
Updating 2
Error 3
Unknown 0


Note:

By default, this measure reports the Measure Values listed in the table above to indicate the current provisioning status. In the graph of this measure however, the same is represented using the numeric equivalents only.

Use the detailed diagnosis to know the host name, port, and version of the cache.
Latency Indicates the time taken by this cache to respond to requests. Seconds A high value is indicative of a slow cache. This can adversely impact application performance.
Cmd_prcsd_per_sec Indicates the rate at which commands were processed by this cache. Commands/second A consistent rise in the value of this measure is a sign of good health. On the other hand, a steady drop in the value of this measure hints at processing bottlenecks. In such a situation, look up the Memory used and CPU usage measures, and cache reads and writes to figure out if there are any resource contentions. A poorly sized cache can often be sluggish when responding to requests. You may want to consider increasing the resource allocations/limits for the cache, so that the cache has more processing power at its disposal.
Hit_rate Indicates the ratio of failed cache lookups to the total number of requests received by this cache. Percent A value less than 80% is a cause for concern, as it implies that the cache has failed to service a large majority of requests to it. In such a situation, check the value of the Server load measure to see if there is any abnormal increase in load, causing the cache server to timeout without completing requests. This can cause cache misses. You can also check the Memory used and Memory fragmentation ratio measures to see if the cache has sufficient memory for storing data. Memory contention on the cache is one of the common causes for poor cache performance.
Mem_used Indicates the amount of memory used by this cache. MB If the value of this measure is close to that of the Maximum memory measure, it means that the cache is about to exhaust its memory allocation. Without enough memory, the cache will not be able to store data. This can result in cache misses, which in turn will affect application performance. To avoid a memory contention therefore, consider the following:

  • You can implement a cluster setup with multiple Redis nodes to enhance the memory capacity of the cache.

  • If the cache cluster is already in place, then you may want to reduce the amount of memory that is available for non-cache operations, so that more memory is available for caching. For that, reduce the maxmemory-reserved setting of the cluster.

  • The maxfragmentationmemoryreserved setting configures the amount of memory, in MB per instance in a cluster, that is reserved to accommodate for memory fragmentation. When you set this value, the Redis server experience is more consistent when the cache is full or close to full and the fragmentation ratio is high. When memory is reserved for such operations, it's unavailable for storage of cached . data. By reducing the value of this setting, you can make sure that more memory is available for caching.

  • Change the eviction policy



Ideally therefore, the value of this measure should be 0.
Mem_frag_ratio Indicates the percentage of memory in this cache that is fragmented. Percent Fragmentation is likely to be caused when a load pattern is storing data with high variation in size. For example, fragmentation might happen when data is spread across 1 KB and 1 MB in size. When a 1-KB key is deleted from existing memory, a 1-MB key cannot fit into it causing fragmentation. Similarly, if 1-MB key is deleted and 1.5-MB key is added, it cannot fit into the existing reclaimed memory. This causes unused free memory and results in more fragmentation.

The fragmentation can cause issues when:

  • Memory usage is close to the max memory limit for the cache, or

  • Used memory is higher than the Max Memory limit, potentially resulting in page faulting in memory

Evicted_keys Indicates the number of keys that have been evicted from this cache. Number If the cache is under severe memory pressure, you will want to see this measure report a high value. To increase the number of keys evicted, you may want to change the eviction policy. The default policy for Azure Cache for Redis is volatile-lru, which means that only keys that have a TTL value set are eligible for eviction. If no keys have a TTL value, then the system won't evict any keys. If you want the system to allow any key to be evicted if under memory pressure, then you may want to consider the allkeyslru policy.
Blocked_clients Indicates the number of client connections to this cache that were blocked. Number Ideally, the value of this measure should be 0.
Conctd_clients Indicates the number of clients currently connected to this cache. Number The maxclients setting governs the maximum number of connected clients allowed at the same time. If the value of this measure is equal to the maxclients setting, Redis closes all the new connections, returning a ‘max number of clients reached’ error.
Conctd_slaves Indicates the number of slaves connected to this cache currently. Number  
Last_intrctn_tym Indicates the time since the master and slave last interacted. Seconds A high value for this measure could be a sign that there are issues in masterslave communication. If these issues persist, failover attempts may fail, and the whole purpose of an HA configuration for the cache will be defeated. Moreover, if slaves do not communicate with the master, then delays in data replication will become inevitable. Timely replication of data between the master and slaves is key to ensuring that the data replicas are always in sync with the master. If they are not, then the slaves may not be able to service cache requests effectively, when the master is down.
Total_keys Indicates the number of keys in this cache's database. Seconds  
Rdb_last_save_time Indicates when data in this cache was written to disk last. Number  
Rdb_chngs_snc_lst_sve Indicates the number of changes that have been written to this cache's database since the last database update happened. Number  
Rejectd_conctn Indicates the number of connections to this cache that were rejected. Number Ideally, the value of this measure should be 0. However, if this measure reports a non-zero value consistently, it could be because the maxclients setting is not set commensurate to the connection load on the cache.

The maxclients setting governs the maximum number of connected clients allowed at the same time. If the value of the Connected clients measure is equal to the maxclients setting, then new connections are closed/rejected.
Key_misses Indicates the number of failed key lookups in this cache. Number A healthy, optimal cache is one that is capable of servicing all requests to it. Ideally therefore, the value of this measure should be 0. A high value for this measure indicates that the cache has failed to service requests to it. In such a situation, check the value of the Server load measure to see if there is any abnormal increase in load, causing the cache server to timeout without completing requests. This can cause cache misses. You can also check the Memory used and Memory fragmentation ratio measures to see if the cache has sufficient memory for storing data. Memory contention on the cache is one of the common causes for poor cache performance.
master_link_down Indicates the duration for which the link between the master and slave was down. Seconds A high value is a cause for concern. The longer the link is down, the longer replication will be delayed. Also, failover attempts will also fail during this period, thus rendering the cache unavailable for servicing requests.
Mem_used_rss Indicates the amount of memory used in resident set size. MB  
Key_hits Indicates the number of successful key lookups in this cache. Number A high value is desired for this measure.
Miss_rate Indicates the percentage of key lookups in this cache that failed. Percent A value close to 100% is a cause for concern, as it implies that the cache has failed to service almost all of the requests to it. In such a situation, check the value of the Server load measure to see if there is any abnormal increase in load, causing the cache server to timeout without completing requests. This can cause cache misses. You can also check the Memory used and Memory fragmentation ratio measures to see if the cache has sufficient memory for storing data. Memory contention on the cache is one of the common causes for poor cache performance.
Max_conn Indicates the maximum number of simultaneous connections that this cache is allowed to entertain. Number If the value of the Connected clients measure is equal to that of this measure, Redis closes all the new connections, returning a ‘max number of clients reached’ error.
Total_cmd Indicates the total number of commands processed by this cache. Number  
Server_load Indicates the current load on this cache server. Percent High server load means the Redis server is busy and unable to keep up with requests, leading to timeouts.

Following are some options to consider for high server load.

  • Scale out to add more shards, so that load is distributed across multiple Redis processes. Also, consider scaling up to a larger cache size with more CPU cores.

  • Avoid client connection spikes

  • Identify and eliminate long running commands;

  • If your Azure Cache for Redis underwent a failover, all client connections from the node that went down are transferred to the node that is still running. The server load could spike because of the increased connections. You can try rebooting your client applications so that all the client connections get recreated and redistributed among the two nodes.

Gets Indicates the number of get operations from this cache. Number  
Sets Indicates the number of set operations from this cache. Number  
Cache_reads Indicates the rate at which data was read from this cache. KB/Second These measures are good indicators of the bandwidth used by the cache. Whenever there is a bandwidth contention, you can compare the value of these measures to know where the maximum bandwidth is spent - when reading from the cache? or when writing to it?
Cache_Writes Indicates the rate at which data was written to this cache. KB/Second
Cpu_usage Indicates the percentage of CPU resources utilized by this cache. Percent A value close to 100% indicates excessive CPU usage. This can adversely impact cache performance. You may want to determine the rootcause of this excess, so that it can be removed and normalcy restored to the cache.