eG Monitoring
 

Measures reported by RabMQQueSumTest

A queue is best defined as a buffer that stores messages. The basic architecture of a message queue is simple, there are client applications called producers that create messages and deliver them to the broker (the message queue). Other applications, called consumers, connects to the queue and subscribes to the messages to be processed. A software can be a producer, or consumer, or both a consumer and a producer of messages. Messages placed onto the queue are stored until the consumer retrieves them.

Since contents of a queue are typically located on a single node in the cluster, failure of that node can deny consumers access to that queue and its contents. Administrators should therefore continuously track the status of the queues, identify queues that are down, and get the down queues up and running before users complain. Also, too many messages in a queue can cause a contention for memory resources. This in turn can slow down message processing. To avoid this, administrators should track the length of each queue, promptly identify the queues that are consistently growing in length , and rapidly initiate measures to curb queue growth. The RabMQQueSumTest test helps administrators with all of the above!

This test auto-discovers the queues on a target node. For each queue, the test then reports the status of the queue, the total number of messages in the queue, and the type of messages - eg., unacknowledged, published, confirmed - in each queue, so administrators can precisely pinpoint queues that are growing in length at an alarming rate, know what type of messages are in such queue, and accordingly decide on how to control the growth in queue length. Additionally, the test reports on the memory usage of each queue, so you can easily assess the impact of queue length on memory. Moreover, the test also reports the rate at which the different types of messages in a queue are delivered to consumers, so that any bottleneck in delivery can be proactively detected and promptly fixed.

Outputs of the test : One set of results for each queue in the target node.

First-level descriptor: Node name

Second-level descriptor: Queue name

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
state Indicates the state of this queue.   The values that this measure can take and their corresponding numeric values are as follows:

Measure Value Numeric Value
Running 0
Idle 1

Note:

This test reports the Measure Values listed in the table above to indicate the queue status. In the graph of this measure however, the same will be represented using the numeric equivalents.

This measure is not available for the Summary descriptor.

runningState Indicates the current count of running queues. Number This measure is reported only for the Summary descriptor.

Use the detailed diagnosis of this measure to know which queues in the cluster are running.
idleState Indicates the current count of idle/down queues. Number This measure is reported only for the Summary descriptor.

Use the detailed diagnosis of this measure to know which queues in the cluster are not running. Ideally, the value of this measure should be 0. If one/more queues go down frequently, you may want to consider mirroring the queues across multiple nodes in a cluster to ensure high availability of the queues.

Each mirrored queue consists of one master and one or more mirrors. The master is hosted on one node commonly referred as the master node. Each queue has its own master node. All operations for a given queue are first applied on the queue's master node and then propagated to mirrors. This involves enqueueing publishes, delivering messages to consumers, tracking acknowledgements from consumers and so on.

Messages published to the queue are replicated to all mirrors. Consumers are connected to the master regardless of which node they connect to, with mirrors dropping messages that have been acknowledged at the master. Queue mirroring therefore enhances availability, but does not distribute load across nodes (all participating nodes each do all the work).

If the node that hosts queue master fails, the oldest mirror will be promoted to the new master as long as it synchronised. Unsynchronised mirrors can be promoted, too, depending on queue mirroring parameters.
avgUnacknowledged Indicates the number of messages in this queue that are waiting for acknowledgement.

For the Summary descriptor, this measure will report the total number of messages across all queues that are waiting for acknowledgement.
Number A low value is desired for this measure. This is because, all unacknowledged messages have to reside in RAM on the servers. If you have too many unacknowledged messages you will run out of memory. An efficient way to limit unacknowledged messages is to limit how many messages your clients prefetch.

To know which queue attached to which node in the cluster has the maximum number of unacknowledged messages, use the detailed diagnosis of this measure.
avgTotal Indicates the total number of messages currently in queue.

For the Summary descriptor, this measure indicates the total number of messages across all queues in the cluster.
Number The value of this measure indicates the current queue length.

Ideally, the value of this measure should be small for any queue. This is because, short queues are the fastest. When a queue is empty, and it has consumers ready to receive messages, then as soon as a message is received by the queue, it goes straight out to the consumer.

Many messages in a queue can put a heavy load on RAM usage. When this happens, RabbitMQ will start flushing (page out) messages to disk in order to free up RAM, and when that happens queueing speeds will deteriorate.

Some common problems with long queues are as follows:

  • Small messages embedded in queue index

  • Take a long time to sync between nodes

  • Time-consuming to start a server with many messages

  • RabbitMQ management interface collects and stores stats for all queues

There are many ways by which you can limit queue size. For starters, you can limit the maximum length of a queue to a set number of messages, or a set number of bytes (the total of all message body lengths, ignoring message properties and any overheads), or both. By default, when a maximum queue length or size is set and the maximum is reached is to drop or dead-letter messages from the front of the queue (i.e. the oldest messages in the queue).

Queue size can also be limited using a Time-To-Live (TTL) extension. RabbitMQ allows you to set TTL (time to live) for both messages and queues.

When TTL is set for a queue, then any message that has been in the queue for longer than the configured TTL is said to be dead.The server guarantees that dead messages will not be delivered using basic.deliver (to a consumer) or included into a basic.get-ok response (for one-off fetch operations). Further, the server will try to remove messages at or shortly after their TTL-based expiry.

A TTL can be specified on a per-message basis, by setting the expiration field in the basic AMQP class when sending a basic.publish. The value of the expiration field describes the TTL period in milliseconds. The same constraints as for x-message-ttl apply. Since the expiration field must be a string, the broker will (only) accept the string representation of the number.
avgPublish Indicates the rate at which publishers are publishing messages to this queue.

For the Summary descriptor, this measure indicates the average rate at which publishers are publishing messages across queues.
Messages/Sec  
avgConfirm Indicates the rate at which the receipt of a message into this queue is confirmed to a publisher.

For the Summary descriptor, this measure indicates the average rate at which the receipt of messages across all queues is confirmed to a publisher.
Messages/Sec A ‘Publish Confirm’ is nothing but a acknowledgement sent by the cluster to a publisher confirming the receipt of a message from that publisher. Publish Confirms have a performance impact. This means, the lower the value of this measure, the better. However, one should keep in mind that a Publish Confirm is required if the publisher needs at-least-once processing of messages.
avgDeliverNoAck Indicates the rate at which this queue delivers messages to consumers that use manual acknowledgements.

For the Summary descriptor, this measure indicates the average rate at which the queues in the cluster deliver messages to consumers that use manual acknowledgements.
Messages/Sec Messages in transit might get lost in an event of a connection failure, and such a message might need to be retransmitted. Acknowledgements let the server and clients know when to retransmit messages.

A manual acknowledgement is an ‘explicit’ acknowledgement that is received from the consumer. Manually sent acknowledgements can be positive or negative. Positive acknowledgements simply instruct RabbitMQ to record a message as delivered and can be discarded. Negative acknowledgements with basic.reject have the same effect. The difference is primarily in the semantics: positive acknowledgements assume a message was successfully processed while their negative counterpart suggests that a delivery wasn't processed but still should be deleted.

Whether positive or negative, manual acknowledgements deliver low throughput and hence, should be avoided. A low value is therefore desired for this measure.
avgDeliver Indicates the rate at which this queue delivers messages to consumers that use automatic acknowledgements.

For the Summary descriptor, this measure indicates the average rate at which the queues in the cluster deliver messages to consumers that use automatic acknowledgements.
Messages/Sec  
avgConsumerAck Indicates the rate at which messages in this queue are being acknowledged by consumers.

For the Summary descriptor, this measure indicates the average rate at which the messages across queues are being acknowledged by consumers.
Messages/Sec If the avgGet measure registers an abnormally low value, then, you may want to check the value of this measure at around the same time to determine whether the a delay by consumers in acknowledging the messages was what caused the delivery delay.
avgRedeliver Indicates the rate at which this queue delivers messages with the ‘redelivered’ flag set.

For the Summary descriptor, this measure reports the average rate at which the queues in the cluster deliver messages with the ‘redelivered’ flag set.
Messages/Sec  
avgGetNoAck Indicates the rate at which this queue delivers messages requiring acknowledgement in response to basic.get.

For the Summary descriptor, this measure reports the average rate at which the queues in the cluster deliver messages requiring acknowledgement in response to basic.get.
Messages/Sec Compare the value of this measure with that of the avgGet measure to figure out what type of messages are being delivered much slower than the rest.
avgGet Indicates the rate at which messages not requiring acknowledgement are being delivered by this queue in response to basic.get.

For the Summary descriptor, this measure reports the average rate at which the queues in the cluster deliver messages not requiring acknowledgement in response to basic.get.
Messages/Sec Compare the value of this measure with that of the avgGetNoAck measure to figure out what type of messages are being delivered much slower than the rest.
avgReturnValue Indicates the rate at which this queue sent unrouteable messages with ‘mandatory’ flag set to ‘true’, to publishers.

For the Summary descriptor, this measure reports the average rate at which the queues in the cluster sent unrouteable messages with ‘mandatory’ flag set to ‘true’, to publishers.
Messages/Sec An unroutable message is a message without a destination. For example, a message sent to an exchange without any bound queue.

If the ‘mandatory’ flag is set to ‘true’, then the cluster return an unroutable message to the producer with a ‘basic.return’ AMQP method.
avgDiskReads Indicates the rate at which this queue reads messages from disk.

For the Summary descriptor, this measure reports the average rate at which the queues in the cluster read messages from disk.
Messages/Sec A high value could indicate that messages are frequently read from the disk and not from the RAM. This could be owing to high memory pressure, which may have forced RabbitMQ to move messages from RAM to disk.
avgDiskWrites Indicates the rate at which this queue wrote messages to disk.

For the Summary descriptor, this measure reports the average rate at which the queues in the cluster wrote messages to disk.
Messages/Sec A high value for this measure could indicate any of the following:

  • Many messages have been published in such a way that they must be written to disk;

  • A very high memory pressure on RabbitMQ has caused the cluster to move majority of messages from RAM to disk;

totalQueues Indicates the total number of queues currently in the cluster. Number This measure is reported only for the Summary descriptor.

Use the detailed diagnosis of the measure to know which are the queues in the target cluster.
memory Indicates the total memory used by this queue.

For the Summary descriptor, this measure reports the total memory used up by all queues in cluster.
MB Ideally, the value of this measure should be low. A consistent increase in this value could indicate excessive memory usage, and could hint at a potential memory contention. The key factor impacting memory usage is message queue length. Time-correlate the changes in the Messages in queue measure with that of the memory measure to figure out if the queue length is indeed increasing the memory pressure. In which case, you may want to initiate measures to limit the queue size.