eG Monitoring
 

Measures reported by AlibabaSQLTest

ApsaraDB for RDS is a stable, reliable, and scalable online database service. It is designed based on the Apsara Distributed File System and high-performance SSD storage media of Alibaba Cloud. It supports five database engines: MySQL, SQL Server, PostgreSQL, PPAS (compatible with Oracle), and MariaDB.

SQL Server supports the high-availability architecture and provides the capability to restore data to a specific point in time. This allows SQL Server to run on various enterprise applications. In addition, SQL Server is provided with a Microsoft-issued license. This relieves the need to purchase a license. ApsaraDB RDS for SQL Server also provides the advanced features and functions such as ApsaraDB for MyBase dedicated cluster, Read-only RDS instances to which read requests can be offloaded in times of an overload, Read/write splitting, and disk encryption.

To ensure a satisfactory user experience with a SQL server instance, the administrator should first measure how quickly the instance processes queries. This will reveal query processing bottlenecks, if any. Since poor buffer usage is a common cause for slow execution of queries, the administrator should then proceed to monitor buffer usage. While at it, if the administrator finds that the SQL buffers are not servicing too many queries, then he/she should investigate the reason for the same and fix it, so that queries execute faster.!

In addition, the administrator should also ascertain whether/not the server instance is capable of bouncing back from a memory failure, with no data loss. For this, the administrator should track checkpoint writes to disk, and isolate slowness (if any) in the process. The locking activity should also be monitored, so that unhealthy locking behavior-eg., deadlock conditions and lock waits-are brought to administrator attention quickly. It is also important that administrators are forewarned of a storage space contention on an instance, so that they can see how they can make more space on that instance before it is too late!

To monitor all the aforesaid performance aspects of a SQL server instance and capture anomalies (if any) on-the-fly, the administrator can use the AlibabaSQLTest test.

For each SQL server instance that is configured using ApsaraDB for RDS, this test reports the rate at which queries and transactions are processed by that instance, thus shedding light on real/potential bottlenecks in query/transaction processing. If query processing is slow, then the administrator can use the buffer usage metric reported by the test to determine if poor buffer cache usage is the reason for query performance to suffer. Using the test, administrator can also check if each instance is keeping full table scans at a minimum. To reveal how resilient a SQL server instance is to power failures, the test also measures the speed with which checkpoints are written to disk by every instance; if the writes are performed slowly, the likelihood of data loss at the time of a power failure becomes high. The test also monitors the locking activity on each instance, and alerts administrators to frequent deadlock conditions and lock waits. Additionally, the test monitors the space usage at the instance level, and points administrators to those instances that are running out of space.

Outputs of the test : One set of results for every RDS instance for SQL Server.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Average_qps Indicates the rate at which this instance processes queries. Queries/Sec A high value is desired for this measure. A low value signifies slowness in query processing. Compare the value of this measure across RDS instances to know which instance is processing queries slowly.
Sql_transaction Indicates the rate at which transactions are processed by this instance. Transactions/Sec A high value is desired for this measure. A low value signifies slowness in transaction processing. Compare the value of this measure across RDS instances to know which instance is processing transactions slowly.
Buffer_rate Indicates the percentage of read requests to this instance that were serviced by the buffer cache. Percent Ideally, the value of this measure should be over 80%. A low value is indicative of poor cache usage. This in turn can cause queries to be served by directly accessing the disk, thereby degrading query performance and overall database health. To avoid this, the buffer cache should be wellsized, so that it can store many objects in memory. This will reduce the likelihood of cache misses, thereby improving query performance significantly.
Full_scans Indicates the rate at which this instance performs full table scans. Scans/Sec When figuring out how best to access data within SQL Server, the query optimizer takes the query that you submit to it and performs some analysis of the objects involved. It tries to determine what indexes (just like the numbering of pages in a book) are available and if they are useful or not to your query, it tries to determine if the criteria in the where clause of the query allows it to use any of the indexes in an efficient manner. When there is no other efficient way to access data within the tables involved in the query, the query processor will process all of the rows in the table one at a time to give you the results.

This is referred to as a Full Table Scan. This is the equivalent of having a book that doesn't have any page numbers, but being told that somewhere in the book is some piece of information that you need to know. So, the only way to access this information is to look at every page one at a time searching for the information you need. Evidently, such an operation is both timeconsuming and resourceintensive, and will adversely impact the performance of the SQL server! This is why, full table scans have to be avoided. Therefore, ideally, the value of this measure should be very low.
Sql_compilation Indicates the rate at which SQL queries are compiled by this instance Compilations/Sec  
Check_point Indicates the rate at which checkpoints are written to disk by this instance. Writes/Sec A checkpoint creates a known good point from which the SQL Server Database Engine can start applying changes contained in the log during recovery after an unexpected shutdown or crash.

For performance reasons, the Database Engine performs modifications to database pages in memory-in the buffer cacheand does not write these pages to disk after every change. Rather, the Database Engine periodically issues a checkpoint on each database. A checkpoint writes the current in-memory modified pages (known as dirty pages) and transaction log information from memory to disk and, also records the information in the transaction log.

If the checkpoint writes modified pages to disk very slowly, then at any given point in time, the disk may not be fully updated with recent changes to data. If at this juncture, a power failure occurs wiping out the pages in memory, the changes will be lost. To avoid this, the checkpoint should write pages to disk at a steady pace. Ideally therefore, the value of this measure should be high.
Total_login Indicates the rate of logins to this SQL server instance. Logins/Sec  
Lock_timout Indicates the number of locks held by this instance that timed out each second. Locks/Sec A lock wait timeout results when one user gets a lock on some data and holds it while another user tries to access it. If the first user doesn't unlock the data, the second one will time out after a while. The database will respond to the second user with an error message saying their lock wait was too long.
Dead_lock Indicates the rate at which deadlocks occurred on this instance. Locks/Sec A deadlock happens when multiple lock waits happen in such a manner that none of the users can do any further work. For example, the first user and second user both lock some data. Then each of them tries to access each other's locked data. There's a cycle in the locking: user A is waiting on B, and B is waiting on A. Neither can proceed.

Deadlocks cause wasteful work delay, and are hence considered to be ‘unproductive’. This is why, the value of this measure should be 0 or very low.
Lock_wait Indicates the rate of locks waits in this instance. Locks/Sec Lock wait event occurs when a user requests for a resource that is already locked by another user, forcing the former to wait until the latter releases the lock.

Lock wait events on a database need to be minimal. If a lock is held on a resource for too long a time, all other requests will be denied access to that resource, thereby causing critical operations to fail. Moreover, if the number of lock waits grows over time, it will consequently increase the length of the pending requests queue; a long request queue may not only cause the unnecessary erosion of valuable server resources, it may also choke the server instance, thereby significantly impacting the quality of the user experience with that instance. This is why, the value of this measure should be 0 or very low for an instance.
Space_usage Indicates the amount of database space that is used by this instance. MB Compare the value of this measure across SQL server instances to identify the instance that is consuming space excessively.