eG Monitoring
 

Measures reported by AlibabaECSTest

Elastic Compute Service (ECS) is a high-performance, stable, reliable, and scalable IaaS-level service provided by Alibaba Cloud. ECS eliminates the need to invest in IT hardware up front and allows you to quickly scale computing resources on demand. This makes ECS more convenient and efficient than physical servers.

An ECS instance is a virtual machine that contains basic computing components such as the vCPU, memory, operating system, network, and disk. You can fully customize and modify all configurations of an ECS instance. After you log on to the Alibaba Cloud Management console, you can manage resources and configure the environment of your ECS instances.

The lifecycle of an ECS instance begins when the instance is created and ends when the instance is released. During this lifecycle, an ECS instances goes through many states. Tracking these states can help administrators quickly and easily resolve user complaints regarding the unavailability/inaccessibility of an instance, which in turn helps in elevating the user experience with that instance.

ECS instances are categorized into different instance families based on business scenarios. An instance family contains different instance types based on their vCPU and memory specifications. Instance types can have different vCPU and memory specifications, such as the CPU model and clock speed. As business requirements change, organizations may want to switch to an instance type that better suits their requirements. It is the responsibility of an administrator to monitor how an instance uses its vCPU and memory specification over time, spot potential resource contentions, and urge the organization to upgrade/downgrade to an appropriate instance type, so as to ensure smooth and uninterrupted transaction of business.

An ECS instance must contain a system disk to store the operating system and core configurations. An image is used to initialize a system disk and determines the operating system and initial software configurations of an ECS instance. Typically, the capacity of system disks is small. Therefore, it is good practice for administrators to continuously track the usage of and I/O activity on the system disks of every instance, and identify those instances with storage space that is insufficient for their needs. By adding more disks to such instances, administrators can enable the instances to boot up without a glitch, thus allowing end-users on-demand access.

Besides vCPU, memory, and disk usage, administrators should also pay attention to the bandwidth usage of instances, so that bandwidth-hungry instances can be identified.

With the help of the AlibabaECSTest test, administrators can achieve all of the above! This test auto-discovers the ECS instances deployed in an Alibaba cloud account. For each instance, this test reports the state of that instance, and alerts administrators if any instance is in an abnormal state (eg., expired, expiring, locked etc.). When instance owners complaint of being unable to access their instances, administrators can instantly figure out if the inaccessibility can be attributed to the abnormal state of the instances. In addition, the test keeps a close watch on the resource (vCPU, memory, disk, and network) usage of each ECS instance in a monitored Alibaba cloud account. In the process, administrators can quickly and accurately identify instances that are over-utilizing resources, and initiate measures to right-size such instances-eg., by way of recommending an upgrade to an instance type with a higher vCPU/memory configuration, by adding more system disks to instances that are running out of disk space, etc.

Outputs of the test : One set of results for each instance in the Alibaba cloud account that is being monitored.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Status Indicates the current state of this instance.   The values that this measure reports and their corresponding numeric values are listed below:

Measure Value Numeric Value
Running 1
Preparing 2
Starting 3
Expiring 4
Stopping 5
Stopped 6
Expired 7
Expired and being recycled 8
Overdue and being recycled 9
Locked 10
Release pending 11

Some of the Measure Values listed in the table above are described below:

  • Preparing: After an instance is created, it is in this state before it enters the Running state. If the instance remains in this state for an extended period of time, an exception occurs.

  • Starting: After an instance is created, it is in this state before it enters the Running state. If the instance remains in this state for an extended period of time, an exception occurs.

  • Running: If an instance runs properly, it is in this state.

  • Expiring: A subscription instance remains in the Expiring state for 15 days before it expires. If your instance enters the Expiring state, we recommend that you renew the instance in a timely manner.

  • Stopping: When you stop an instance by using the ECS console or by calling an API operation, the instance enters this state before it enters the Stopped state. If the instance remains in this state for an extended period of time, an exception occurs.

  • Stopped: After an instance is stopped or after an instance is created but has not started, it is in the Stopped state.

  • Expired: When a subscription instance expires or when a pay-as-you-go instance is stopped due to overdue payments, the instance enters the Expired state.

  • Locked: If you have an overdue payment in your account or if your account is insecure, your instance enters the Locked state. You can submit a ticket to unlock the instance.

  • To be released: If you apply for a refund for a subscription instance before the instance expires, the instance enters the To Be Released state.

Note:

This measure reports the Measure Values listed in the table above to indicate the current state of an ECS instance. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Use the detailed diagnosis of this measure to know more about the instance. The details displayed include the instance type, when it was created, the operating system of the instance, the region and zone to which the instance belongs, the image from which the instance was created, and the network type, IP addresses, VPC, and security group of the instance.

Total_cpu Indicates the total number of CPU cores configured for this instance. Number  
Total_memory Indicates the memory configuration of this instance. MB  
Total_bandwidth Indicates the total inbound and outbound bandwidth usage of this instance. Kbps Compare the value of this measure across instances to know which instance is making the most use of the bandwidth resources.
Network_inboud Indicates the maximum bandwidth used by traffic flowing into this instance from the public network. Kbps These metrics will give administrators an idea as to where public bandwidth resources are spent.
Network_outboud Indicates the maximum bandwidth used by traffic flowing out of this instance to the public network. Kbps
CPU_utilization Indicates the percentage of allocated CPU units that is currently used by this instance. Percent A value close to 100% for an instance indicates that such an instance is overutilizing the CPU resources allocated to it.
Intra_traffic_recive Indicates the bandwidth consumed by traffic flowing into this instance from the intranet. Kbps By comparing the value of this measure across instances, you can accurately identify the instance that is receiving bandwidth-intensive intranet traffic.
Intra_traffic_sent Indicates the bandwidth consumed by traffic flowing out of this instance to the intranet. Kbps By comparing the value of this measure across instances, you can accurately identify the instance that is sending bandwidth-intensive intranet traffic.
Intranet_bandwidth Indicates the total bandwidth consumed by intranet traffic flowing into and out of this instance. Kbps Compare the value of this measure across instances to identify the instance handling bandwidth-intensive intranet traffic. You can then compare the value of the Intranet traffic received and Intranet traffic sent measures of that instance to figure out what type of intranet traffic is hogging the bandwidth resources-incoming traffic? or outgoing traffic?
Internet_bandwidth Indicates the total bandwidth consumed by internet traffic flowing into and out of this instance. Kbps Compare the value of this measure across instances to identify the instance handling bandwidth-intensive internet traffic. You can then compare the value of the Internet traffic received and Internet traffic sent measures of that instance to figure out what type of internet traffic is hogging the bandwidth resources-incoming traffic? or outgoing traffic?
Inter_traffic_recive Indicates the bandwidth consumed by traffic flowing into this instance from the internet. Kbps By comparing the value of this measure across instances, you can accurately identify the instance that is receiving bandwidth-intensive internet traffic.
Inter_traffic_sent Indicates the bandwidth consumed by traffic flowing out of this instance to the internet. Kbps By comparing the value of this measure across instances, you can accurately identify the instance that is sending bandwidth-intensive internet traffic.
Total_disk_iops Indicates rate at which I/O operations were performed on the disks of this instance. Operations/Sec Compare the value of this measure across instances to know which instance is experiencing unusually high levels of I/O activity. In such a situation, you can compare the value of the Disk read operations and Disk write operations measures for that instance to accurately isolate what caused the I/O overload-a high rate of read operations? or write operations?
Disk_reads_iops Indicates the rate at which disk read operations were performed by this instance. Operations/Sec By comparing the value of this measure across instances, you can accurately identify the instance that is experiencing a high level of disk read operations.
Disk_writes_iops Indicates the rate at which disk write operations were performed by this instance. Operations/Sec By comparing the value of this measure across instances, you can accurately identify the instance that is experiencing a high level of disk write operations.
Total_disk_bps Indicates the bandwidth consumed by disk read/write operations on this instance. KB/Sec If this measure is very high for an instance, it means that the I/O activity on the disks of that instance is consuming bandwidth excessively. In such a situation, you can compare the value of the Disk read bandwidth and Disk write bandwidth measures of that instance to understand what type of I/O activity is contributing to the unusual bandwidth consumption-read activity?or write activity?
Disk_reads_bps Indicates the bandwidth consumed by disk read operations on this instance. KB/Sec Compare the value of this measure across instances to know which instance is engaged in bandwidthintensive disk reads.
Disk_writes_bps Indicates the bandwidth consumed by disk write operations on this instance. KB/Sec Compare the value of this measure across instances to know which instance is engaged in bandwidthintensive disk writes.
CPU_usage Indicates the number of CPU credits consumed by this instance. Number This measure is reported only for burstable instances.

Burstable instances are an economical instance type that is intended to cope with burstable performance requirements in entry-level computing scenarios. These instances use CPU credits to ensure computing performance, and are suited for scenarios where CPU usage is typically low but bursts in CPU usage occur on occasion. You can accumulate CPU credits that can be used to increase the computing performance of burstable instances when required by your workloads. The CPU credit mechanism allows you to minimize the consumption of resources during off-peak hours, and scale resources out during peak hours at no extra cost.

When you create a burstable instance, 30 CPU credits are provisioned for each vCPU of the instance, which are initial CPU credits. These credits enable you to complete deployment tasks after you start the instance. When a burstable instance is started, it starts to consume CPU credits to maintain its computing performance. The value of this measure denotes the number of CPU credits so spent.

By comparing the value of this measure across burstable instances, you can quickly identify the instance that is consuming too many CPU credits.
CPU_credit_balance Indicates the CPU credits that are still to be used by this instance. Number As mentioned earlier, once a burstable instance is started, it begins consuming Initial CPU credits of 30 that is provisioned to it. While at it, the burstable instance also earns CPU credits at a fixed rate that is determined by the instance type. The amount of CPU credits that a vCPU can earn per hour is based on its baseline performance-i.e., the amount of vCPU capacity that is continuously provisioned to a burstable instance. For example, 25% baseline performance of instance A indicates that the CPU credits that a vCPU of the instance earns per hour can keep the vCPU running at 25% utilization for an hour or at 100% utilization for 15 minutes (60 × 25%). In response to its baseline performance, each vCPU earns 15 CPU credits per hour. Therefore, if instance A has two vCPUs, it earns 30 CPU credits per hour.

If the CPU credits so earned exceed the credits consumed, the net credits are accrued as CPU credit balance. This is the value that is reported by the CPU credit balance measure. A high value is desired for this measure, as a high CPU credit balance for a burstable instance means that CPU resources are guaranteed to that instance for a maximum of 24 hours.
Total_disk Indicates the total number of disks currently used by this instance. Number Use the detailed diagnosis of this measure to know which disks are used by the instance, the type of each disk, when every disk was created, the image that stores a copy of that disk's data, and when the disk was attached to the instance.
Disk_size Indicates the total capacity of disks used by this instance. GB  
Cpu_wait Indicates the percentage of the CPU processes waiting for I/O operations to complete. Percent A high value indicates frequent I/O operations on an instance.
Free_memory Indicates the percentage of memory allocated to this instance that is still unused. Percent A high value is desired for this measure. A value close to 100% indicates that the instance is running out of memory.
Used_memory Indicates the percentage of allocated memory that is used by this instance. Percent A low value is desired for this measure. A value close to 100% is a cause for concern, as it indicates that the instance is rapidly running out of memory. If the instance appears to consistently over-utilizing its memory, you may want to consider upgrading to a different instance type to meet with its memory demand.
System_load Indicates the average load on this instance during the last 5 minutes. Percent A high value indicates that the instance is busy.
Total_snapshot Indicates the total number of snapshots created for disks used by this instance. Number The Alibaba Cloud snapshot service allows you to create crashconsistent snapshots for all disk categories. You can use snapshots for the following scenarios:

  • Disaster recovery and backup: You can create a snapshot for a disk and then use the snapshot to create another disk to implement zone-or geo-disaster recovery

  • Environment clone: You can use a system disk snapshot to create a custom image and then use the custom image to create ECS instances with identical environments.

  • Data development: Snapshots can provide nearreal-time production data for applications such as data mining, report queries, and development and testing.

  • Enhancement of fault tolerance: You can roll a disk back to a previous point in time by using a snapshot to reduce the risk of data loss caused by incorrect operations.

Snapshot_size Indicates the total size of the snapshots created for the disks used by this instance. Number