Measures reported by AWSRedShiftTest
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The first step to create such a data warehouse is to launch an Amazon Redshift cluster. An Amazon Redshift cluster is a collection of computing resources called nodes. Each cluster runs an Amazon Redshift engine and contains one or more databases. Each cluster has a leader node and one or more compute nodes. The leader node receives queries from client applications, parses the queries, and develops query execution plans. The leader node then coordinates the parallel execution of these plans with the compute nodes, aggregates the intermediate results from these nodes, and finally returns the results back to the client applications. Compute nodes execute the query execution plans and transmit data among themselves to serve these queries. The intermediate results are sent back to the leader node for aggregation before being sent back to the client applications.
Where RedShift is in use, query performance, and consequently, the performance of the dependent client applications, depends upon the following factors:
Cluster availability
How the cluster and its nodes use the CPU, network, and storage resources of the cluster;
Responsiveness of the nodes in the cluster to I/O requests from client applications
To be able to accurately assess whether cluster performance is at the desired level or not, an administrator would require real-time insights into each of the factors listed above. The AWSRedShiftTest test provides administrators with these valuable insights. By reporting the current health status of each cluster managed by RedShift, this test brings unavailable clusters to light. The resource usage of the cluster is also reported, so that potential resource contentions can be proactively isolated. Optionally, you can also configure this test to report metrics for individual nodes in the cluster as well. If this is done, then administrators will be able to instantly drill-down from a resource-hungry cluster to the exact node in the cluster that could hogging the resources. At the node-level, the latency and throughput of each node is also revealed. This way, when users complain of degradation in the performance of client applications, you can quickly identify the cluster and the precise node in the cluster that is slowing down I/O processing and consequently, impacting application performance.
Output of the test : One set of results for each cluster and/or node in every AWS region on the cloud monitored
First-level descriptor : AWS Region
Second-level descriptor : Cluster
Third-level descriptor : Node
The measures made by this test are as follows:
| Measurement |
Description |
Measurement Unit |
Interpretation |
| cpu_util |
Indicates the percentage of CPU utilized by this cluster/node. |
Percent |
For a cluster, this measure will report the aggregate CPU usage of all nodes in the cluster. If the value of this measure is consistently above 50% for a cluster, it indicates that a serious resource contention may occur on that cluster, if additional processing power is not provided to it. In such a case, you may want to consider adding more nodes to the cluster, or adding more CPUs to the existing nodes.
You can also compare the CPU usage of nodes in the resource-hungry cluster to determine whether one/more nodes are hogging the CPU. If so, you may want to tweak the load-balancing algorithm of your cluster to ensure uniform load distribution. |
| db_connection |
Indicates the number of connections to the databases in this cluster. |
Number |
This measure is only reported at the cluster-level and not the node-level. |
| health_status |
Indicates the current health status of this cluster. |
|
This measure is only reported at the cluster-level and not the node-level.
Every minute the cluster connects to its database and performs a simple query. If it is able to perform this operation successfully, then the value of this measure will be Healthy. Otherwise, the value of this measure will be Unhealthy. An Unhealthy status can occur when the cluster database is under extremely heavy load or if there is a configuration problem with a database on the cluster.
The numeric values that correspond to the measure values mentioned above are as follows:
| Measure Value |
Numeric Value |
| Healthy |
1 |
| Unhealthy |
0 |
Note:
This measure will report one of the Measure Values listed above to indicate the current state of a cluster. In the graph of this measure however, cluster status will be indicated using the numeric equivalents only.
|
| maintance_mode |
Indicates whether/not this cluster is in the maintenance mode presently. |
|
The values that this measure can report and their corresponding numeric values are listed in the table below:
| Measure Value |
Numeric Value |
| Yes |
1 |
| No |
0 |
Note:
This measure will report one of the Measure Values listed above to indicate whether/not a cluster is in the maintenance mode. In the graph of this measure however, the same will be indicated using the numeric equivalents only.
This measure is reported only at the cluster-level and not the node-level.
Even though your cluster might be unavailable due to maintenance tasks, the health_status measure of the test will report the value Healthy for that cluster.
|
| nw_recve_through |
Indicates the rate at which this cluster or node receives data. |
KB/Secs |
For a cluster, a consistent increase in the value of these measures is indicative of excessive usage of network resources by the cluster.
In such a case, compare the value of these measures across the nodes of a cluster to identify the nodes that are over-utilizing network bandwidth. |
| nw_trans_through |
Indicates the rate at which this cluster or node sends data. |
KB/Secs |
| percnt_diskspcused |
Indicates the percentage of disk space used by this cluster/node. |
Percent |
If the value of this measure is close to 100% for a cluster, it indicates that the cluster is rapidly running out of storage resources. You may want to consider adding more nodes to the cluster to increase the storage space available. Alternatively, you can add fewer nodes and yet significantly increase the cluster resources by opting for node types that are by default large-sized and hence come bundled with considerable storage space.
When a cluster's storage resources are rapidly depleting, you may want to compare the space usage of the nodes in cluster, so that you can quickly isolate that node that is eroding the space. Tweaking your load-balancing algorithm could go a long way in eliminating such node overloads. |
| read_iops |
Indicates the average number of disk read operations performed by this node per second. |
Reads/Sec |
A high value is desired for this measure, as that's the trait of a healthy node. You can compare the value of this measure across nodes to identify the node that is slowest in processing read requests. |
| read_laten |
Indicates the average amount of time taken by this node for disk read I/O operations. |
Secs |
Ideally, the value of this measure should be very low. Its good practice to compare the value of this measure across nodes of a cluster and isolate those nodes in the cluster where the value of this measure is abnormally high. Such nodes slow down I/O processing and adversely affect application performance. |
| read_through |
Indicates the average number of bytes read from disk by this node per second. |
KB/Sec |
A high throughput signifies faster processing of read I/O requests. A low throughput is indicative of slow read request processing. Compare the value of this measure across nodes of a cluster to isolate those nodes that have registered an abnormally low value for this measure. Such nodes not only affect cluster performance, but also the performance of dependent client applications. |
| write_iops |
Indicates the average number o disk write operations performed by this node per second. |
Writes/Sec |
A high value is desired for this measure, as that's the trait of a healthy node. You can compare the value of this measure across nodes to identify the node that is slowest in processing write requests. |
| write_laten |
Indicates the average amount of time taken by this node for disk write I/O operations. |
Secs |
Ideally, the value of this measure should be very low. Its good practice to compare the value of this measure across nodes of a cluster and isolate those nodes in the cluster where the value of this measure is abnormally high. Such nodes slow down I/O processing and adversely affect application performance. |
| write_through |
Indicates the average number of bytes written to disk by this node per second. |
KB/Sec |
A high throughput signifies faster processing of write I/O requests. A low throughput is indicative of slow write request processing. Compare the value of this measure across nodes of a cluster to isolate those nodes that have registered an abnormally low value for this measure. Such nodes not only affect cluster performance, but also the performance of dependent client applications. |
|