eG Monitoring
 

Measures reported by AWSRegRShiftTest

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The first step to create such a data warehouse is to launch an Amazon Redshift cluster. An Amazon Redshift cluster is a collection of computing resources called nodes. Each cluster runs an Amazon Redshift engine and contains one or more databases. Each cluster has a leader node and one or more compute nodes. The leader node receives queries from client applications, parses the queries, and develops query execution plans. The leader node then coordinates the parallel execution of these plans with the compute nodes, aggregates the intermediate results from these nodes, and finally returns the results back to the client applications. Compute nodes execute the query execution plans and transmit data among themselves to serve these queries. The intermediate results are sent back to the leader node for aggregation before being sent back to the client applications.

Where RedShift is in use, query performance, and consequently, the performance of the dependent client applications, depends upon the following factors:

  • Cluster availability

  • How the cluster and its nodes use the CPU, network, and storage resources of the cluster;

  • Responsiveness of the nodes in the cluster to I/O requests from client applications

To be able to accurately assess whether cluster performance is at the desired level or not, an administrator would require real-time insights into each of the factors listed above. The AWSRegRShiftTest test provides administrators with these valuable insights. By reporting the current health status of each cluster managed by RedShift, this test brings unavailable clusters to light. The resource usage of the cluster is also reported, so that potential resource contentions can be proactively isolated. Optionally, you can also configure this test to report metrics for individual nodes in the cluster as well. If this is done, then administrators will be able to instantly drill-down from a resource-hungry cluster to the exact node in the cluster that could hogging the resources. At the node-level, the latency and throughput of each node is also revealed. This way, when users complain of degradation in the performance of client applications, you can quickly identify the cluster and the precise node in the cluster that is slowing down I/O processing and consequently, impacting application performance.

Output of the test : One set of results for each cluster and/or node in the monitored AWS region

First-level descriptor : Cluster

Second-level descriptor : Node

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
cpu_util Indicates the percentage of CPU utilized by this cluster/node. Percent For a cluster, this measure will report the aggregate CPU usage of all nodes in the cluster. If the value of this measure is consistently above 50% for a cluster, it indicates that a serious resource contention may occur on that cluster, if additional processing power is not provided to it. In such a case, you may want to consider adding more nodes to the cluster, or adding more CPUs to the existing nodes.

You can also compare the CPU usage of nodes in the resource-hungry cluster to determine whether one/more nodes are hogging the CPU. If so, you may want to tweak the load-balancing algorithm of your cluster to ensure uniform load distribution.
db_connection Indicates the number of connections to the databases in this cluster. Number This measure is only reported at the cluster-level and not the node-level.
health_status Indicates the current health status of this cluster.   This measure is only reported at the cluster-level and not the node-level.

Every minute the cluster connects to its database and performs a simple query. If it is able to perform this operation successfully, then the value of this measure will be Healthy. Otherwise, the value of this measure will be Unhealthy. An Unhealthy status can occur when the cluster database is under extremely heavy load or if there is a configuration problem with a database on the cluster.

The numeric values that correspond to the measure values mentioned above are as follows:

Measure Value Numeric Value
Healthy 1
Unhealthy 0

Note:

This measure will report one of the Measure Values listed above to indicate the current state of a cluster. In the graph of this measure however, cluster status will be indicated using the numeric equivalents only.

maintance_mode Indicates whether/not this cluster is in the maintenance mode presently.   The values that this measure can report and their corresponding numeric values are listed in the table below:

Measure Value Numeric Value
Yes 1
No 0

Note:

  • This measure will report one of the Measure Values listed above to indicate whether/not a cluster is in the maintenance mode. In the graph of this measure however, the same will be indicated using the numeric equivalents only.

  • This measure is reported only at the cluster-level and not the node-level.

  • Even though your cluster might be unavailable due to maintenance tasks, the health_status measure of the test will report the value Healthy for that cluster.

nw_recve_through Indicates the rate at which this cluster or node receives data. KB/Secs For a cluster, a consistent increase in the value of these measures is indicative of excessive usage of network resources by the cluster.

In such a case, compare the value of these measures across the nodes of a cluster to identify the nodes that are over-utilizing network bandwidth.
nw_trans_through Indicates the rate at which this cluster or node sends data. KB/Secs
percnt_diskspcused Indicates the percentage of disk space used by this cluster/node. Percent If the value of this measure is close to 100% for a cluster, it indicates that the cluster is rapidly running out of storage resources. You may want to consider adding more nodes to the cluster to increase the storage space available. Alternatively, you can add fewer nodes and yet significantly increase the cluster resources by opting for node types that are by default large-sized and hence come bundled with considerable storage space.

When a cluster's storage resources are rapidly depleting, you may want to compare the space usage of the nodes in cluster, so that you can quickly isolate that node that is eroding the space. Tweaking your load-balancing algorithm could go a long way in eliminating such node overloads.
read_iops Indicates the average number of disk read operations performed by this node per second. Reads/Sec A high value is desired for this measure, as that's the trait of a healthy node. You can compare the value of this measure across nodes to identify the node that is slowest in processing read requests.
read_laten Indicates the average amount of time taken by this node for disk read I/O operations. Secs Ideally, the value of this measure should be very low. Its good practice to compare the value of this measure across nodes of a cluster and isolate those nodes in the cluster where the value of this measure is abnormally high. Such nodes slow down I/O processing and adversely affect application performance.
read_through Indicates the average number of bytes read from disk by this node per second. KB/Sec A high throughput signifies faster processing of read I/O requests. A low throughput is indicative of slow read request processing. Compare the value of this measure across nodes of a cluster to isolate those nodes that have registered an abnormally low value for this measure. Such nodes not only affect cluster performance, but also the performance of dependent client applications.
write_iops Indicates the average number o disk write operations performed by this node per second. Writes/Sec A high value is desired for this measure, as that's the trait of a healthy node. You can compare the value of this measure across nodes to identify the node that is slowest in processing write requests.
write_laten Indicates the average amount of time taken by this node for disk write I/O operations. Secs Ideally, the value of this measure should be very low. Its good practice to compare the value of this measure across nodes of a cluster and isolate those nodes in the cluster where the value of this measure is abnormally high. Such nodes slow down I/O processing and adversely affect application performance.
write_through Indicates the average number of bytes written to disk by this node per second. KB/Sec A high throughput signifies faster processing of write I/O requests. A low throughput is indicative of slow write request processing. Compare the value of this measure across nodes of a cluster to isolate those nodes that have registered an abnormally low value for this measure. Such nodes not only affect cluster performance, but also the performance of dependent client applications.