eG Monitoring
 

Measures reported by RabMQClusterTest

RabbitMQ server is written in the Erlang programming language. Each Erlang process has its own stack and heap which are allocated in the same memory block and grow towards each other. When the stack and the heap meet, the garbage collector is triggered and memory is reclaimed. If the garbage collector does not reclaim enough memory, the heap will grow to accomodate more data. If heap growth is not controlled by efficient garbage collection, it can degrade the performance of the RabbitMQ node, and consequently, slowdown cluster operations as well.

Using the RabMQNodeGCTest test, you can keep tabs on garbage collection activity on each node of a cluster and identify the node from which the least memory was reclaimed. When a cluster under-performs, you can use this test to figure out if the dip in cluster performance is owing to excessive heap growth on a node caused by inefficient garbage collection.

Outputs of the test : One set of results for each node in the monitored RabbitMQ Cluster.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
diffReady Indicates the number of messages that are available to be delivered now. Number Use the detailed diagnosis of this measure to receive an overview of the RabbitMQ cluster setup. The RabbitMQ version, management version, and Erlang version will be displayed as part of the detailed diagnostics.
diffUnacknowledged Indicates the number of messages for which the cluster is waiting for acknowledgement. Number A low value is desired for this measure. This is because, all unacknowledged messages have to reside in RAM on the servers. If you have too many unacknowledged messages you will run out of memory. An efficient way to limit unacknowledged messages is to limit how many messages your clients prefetch.

To know which queue attached to which node in the cluster has the maximum number of unacknowledged messages, use the detailed diagnosis of this measure.
diffTotal Indicates the total number of messages on the cluster currently. Number This is the sum total of the value of the diffReady and diffUnacknowledged.
diffPublish Indicates the rate at which publishers are publishing messages on the server. Messages/Sec  
diffConfirm Indicates the rate at which the cluster confirms the receipt of a message to a publisher. Messages/Sec A ‘Publish Confirm’ is nothing but a acknowledgement sent by the cluster to a publisher confirming the receipt of a message from that publisher. Publish Confirms have a performance impact. This means, the lower the value of this measure, the better. However, one should keep in mind that a Publish Confirm is required if the publisher needs at-least-once processing of messages.
diffDeliverNoAck Indicates the rate at which messages are delivered to consumers that use manual acknowledgements. Number Messages in transit might get lost in an event of a connection failure, and such a message might need to be retransmitted. Acknowledgements let the server and clients know when to retransmit messages.

A manual acknowledgement is an ‘explicit’ acknowledgement that is received from the consumer. Manually sent acknowledgements can be positive or negative. Positive acknowledgements simply instruct RabbitMQ to record a message as delivered and can be discarded. Negative acknowledgements with basic.reject have the same effect. The difference is primarily in the semantics: positive acknowledgements assume a message was successfully processed while their negative counterpart suggests that a delivery wasn't processed but still should be deleted.

Whether positive or negative, manual acknowledgements deliver low throughput and hence, should be avoided. A low value is therefore desired for this measure.
diffDeliver Indicates the rate at which messages are delivered to consumers that use automatic acknowledgements. Messages/Sec  
diffConsumerAck Indicates the rate at which messages are being acknowledged by consumers. Messages/Sec If the diffGetNoAck measure registers an abnormally low value, then, you may want to check the value of this measure at around the same time to determine whether the delay by consumers in acknowledging the messages was what caused the delivery delay.
diffRedeliver Indicates the rate at which messages with the ‘redelivered’ flag set are being delivered. Messages/Sec  
diffGet Indicates the rate at which messages not requiring acknowledgement are being delivered in response to basic.get. Messages/Sec Compare the value of this measure with that of the diffGetNoAck measure to figure out what type of messages are being delivered much slower than the rest.
diffGetNoAck Indicates the rate at which messages requiring acknowledgement are being delivered in response to basic.get. Messages/Sec Compare the value of this measure with that of the diffGet measure to figure out what type of messages are being delivered much slower than the rest.
diffReturnValue Indicates the rate at which unrouteable messages with ‘mandatory’ flag set to ‘true’, were sent to publishers. Messages/Sec An unroutable message is a message without a destination. For example, a message sent to an exchange without any bound queue.

If the ‘mandatory’ flag is set to ‘true’, then the cluster return an unroutable message to the producer with a ‘basic.return’ AMQP method.
diffDiskReads Indicates the rate at which queues read messages from disk. Messages/Sec A high value could indicate that messages are frequently read from the disk and not from the RAM. This could be owing to high memory pressure, which may have forced RabbitMQ to move messages from RAM to disk.
diffDiskWrites Indicates the rate at which queues wrote messages to disk. Messages/Sec A high value for this measure could indicate any of the following:

  • Many messages have been published in such a way that they must be written to disk;

  • A very high memory pressure on RabbitMQ has caused the cluster to move majority of messages from RAM to disk;

currConnections Indicates the number of connections for all virtual hosts that the current user has access to. Number Each connection uses about 100 KB of RAM (and even more, if TLS is used). Which means, if the value of currConnections measure is over 1000, it can be a heavy burden on a RabbitMQ server. In the worst case, the server can crash due to out-of-memory. The AMQP protocol has a mechanism called channels that “multiplexes” a single TCP connection. It's recommended that each process only creates one TCP connection, and uses multiple channels in that connection for different threads. Its also recommended that both connections and channels are kept at a minimum. This is because, an unusually high value for the currConnections and the currChannels measures can adversely impact the performance of the RabbitMQ management interface.
currChannels Indicates the total number of channels for all virtual hosts the current user has access to. Number
currExchanges Indicates the total number of exchanges for all virtual hosts the current user has access to. Number An exchange is responsible for the routing of the messages to the different queues. An exchange accepts messages from the producer application and routes them to message queues with the help of bindings and routing keys.
currQueues Indicates the total number of queues for all virtual hosts the current user has access to. Number A queue is a buffer that stores messages. Messages are published to exchanges, which distribute them to queues using rules called bindings.

Queues are single-threaded in RabbitMQ, and one queue can handle up to about 50k messages/s. You will achieve better throughput on a multi-core system if you have multiple queues and consumers. You will achieve optimal throughput if you have as many queues as cores on the underlying node(s). This means, ideally, the value of this measure should be equal to the number of cores on the monitored node.

If the value of this measure is over 1000, it is a cause for concern. This is because, the RabbitMQ management interface will keep information about all queues and this might slow down the server. The CPU and RAM usage may also be affected in a negative way if you have too many queues (thousands of queues). The RabbitMQ management interface collects and calculates metrics for each and every queue which uses some resources and CPU and disk contention can occur if you have thousands up on thousands of active queues and consumers.
currConsumers Indicates the total number of consumers for all virtual hosts the current user has access to. Number A consumer is a user application that receives messages.

You will achieve better throughput on a multi-core system if the value of this measure is more than one. However, if there are a large number of consumers, CPU and disk contention can occur on the RabbitMQ management interface.
publish Indicates the total number of messages entering the server. Number  
confirm Indicates the total number of messages that the server is confirming to publisher. Number A ‘Publish Confirm’ is nothing but a acknowledgement sent by the cluster to a publisher confirming the receipt of a message from that publisher. Publish Confirms have a performance impact. This means, the lower the value of this measure, the better. However, one should keep in mind that a Publish Confirm is required if the publisher needs at-least-once processing of messages.
deliverNoAck Indicates the total number of messages that this virtual host delivered to consumers that use manual acknowledgements. Number Messages in transit might get lost in an event of a connection failure, and such a message might need to be retransmitted. Acknowledgements let the server and clients know when to retransmit messages.

A manual acknowledgement is an ‘explicit’ acknowledgement that is received from the consumer. Manually sent acknowledgements can be positive or negative. Positive acknowledgements simply instruct RabbitMQ to record a message as delivered and can be discarded. Negative acknowledgements with basic.reject have the same effect. The difference is primarily in the semantics: positive acknowledgements assume a message was successfully processed while their negative counterpart suggests that a delivery wasn't processed but still should be deleted.

Whether positive or negative, manual acknowledgements deliver low throughput and hence, should be avoided. A low value is therefore desired for this measure.
deliver Indicates the total number of messages that this virtual host delivered to consumers that use automatic acknowledgements. Number In automatic acknowledgement mode, a message is considered to be successfully delivered immediately after it is sent. This mode trades off higher throughput (as long as the consumers can keep up) for reduced safety of delivery and consumer processing. This mode is often referred to as “fire-and-forget”. Unlike with manual acknowledgement model, if consumers's TCP connection or channel is closed before successful delivery, the message sent by the server will be lost. Therefore, automatic message acknowledgement should be considered unsafe and not suitable for all workloads.

The value 0 is hence ideal for this measure.
consumerAck Indicates the total number of messages that are being acknowledged by consumers of this virtual host. Number  
redeliver Indicates the total number of messages that are being delivered by this virtual host, with the ‘redelivered’ flag set. Number If a message is delivered to a consumer and then requeued (because it was not acknowledged before the consumer connection dropped, for example) then RabbitMQ will set the ‘redelivered’ flag on it when it is delivered again (whether to the same consumer or a different one’. This is a hint that a consumer may have seen this message before (although that's not guaranteed, the message may have made it out of the broker but not into a consumer before the connection dropped). Conversely if the redelivered flag is not set then it is guaranteed that the message has not been seen before.
getNoAck Indicates the total number of messages not requiring acknowledgement that are being delivered in response to basic.get on this virtual host. Number  
get Indicates the total number of messages requiring acknowledgement that are being delivered in response to basic.get. Number For best performance and high throughput, the value of this measure should be low.
returnValue Indicates the rate at which this virtual host sent unrouteable messages with ‘mandatory’ flag set to ‘true’, to publishers. Messages/Sec An unroutable message is a message without a destination. For example, a message sent to an exchange without any bound queue.

If the ‘mandatory’ flag is set to ‘true’, then an unroutable message is returned to the producer with a ‘basic.return’ AMQP method.

To know which nodes in the cluster returned the maximum number of messages to publishers, use the detailed diagnosis of this measure.
diskReads Indicates the total number of messages read from disk on this virtual host. Number A high value could indicate that many messages are read from the disk and not from the RAM. This could be owing to high memory pressure, which may have forced RabbitMQ to move messages from RAM to disk.

If this measure reports an abnormally high value, then use the detailed diagnosis of this measure to know which nodes in the cluster are performing the maximum reads from the disk. Such nodes could be running out of memory.
diskWrites Indicates the total number of messages written to disk on this virtual host. Number A high value for this measure could indicate any of the following:

  • Many messages have been published in such a way that they must be written to disk;

  • A very high memory pressure on RabbitMQ has caused the cluster to move majority of messages from RAM to disk;

If this measure reports an abnormally high value, then use the detailed diagnosis of this measure to know which nodes in the cluster are performing the maximum reads from the disk. Such nodes could be running out of memory.