eG Monitoring
 

Measures reported by RabMQNodesTest

A RabbitMQ cluster is a logical grouping of one or several Erlang nodes, each running the RabbitMQ application and sharing users, virtual hosts, queues, exchanges, bindings, and runtime parameters.

A client can connect to any node and perform any operation. Nodes will route operations to the queue master node transparently to clients. In case of a node failure, clients will be able to reconnect to a different node, recover their topology and continue operation. Regardless of which node is serving client requests, at any point in time, administrators should be able to tell the operational state of each node in the cluster, so that the failed nodes can be identified.

Moreover, client connections, channels, and queues are distributed across cluster nodes. This means that all nodes in a cluster should be sized with adequate resources such as memory, bandwicth, disk space, file/socket handlers, and Erlang processes. Administrators should be able to track the usage of these critical resources on each node, and pinpoint the node that is under-sized.

Additionally, administrators should observe the reads from and writes to the queue index journals, message store, and disk of each node to gauge the level of activity on each node and measure a node's ability to handle these activity levels.

The RabMQNodesTest test enables administrators to perform all the above! This test auto-discovers the nodes in a target cluster. For each node, the test then reports the state of that node, its uptime, and how its memory, file descriptors, socket descriptors, bandwidth resources and Erlang processes have been utilized. Nodes that are down and those that are running out of resources are revealed in the process. Furthermore, the test reports the rate at which reads, writes, seeks, and syncs were performed on the disk of each node, thus revealing the I/O processing ability of each node. The time taken by every node to perform these I/O operations is also reported, so that latent nodes can be identified.

Outputs of the test : One set of results for each node in the monitored RabbitMQ Cluster.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
status Indicates the current state of this node.   The values that this measure can take and their corresponding numeric values are as follows:

Measure Value Numeric Value
Running 1
Stopped 0

Note:

This test reports the Measure Values listed in the table above to indicate the current operational state of a node. In the graph of this measure however, the same will be represented using the numeric equivalents.

uptime Indicates the uptime of this node (in days). Days Compare the value of this measure across nodes to identify the node that has been down for the longest time.
fdTotal Indicates the maximum number of file descriptors that this node can use. Number By default, a node can use upto a maximum of 1024 file descriptors.

A file descriptor (FD, less frequently fildes) is an abstract indicator (handle) used to access a file or other input/output resource, such as a pipe or network socket.
usedFD Indicates what percentage of the maximum number of file descriptors configured for this node is currently in use. Percent A value close to 100% is a cause for concern, as it indicates that the node is running out of file handles. If the value reaches 100%, then the node will block all incoming connections. To avoid this, you may want to consider increasing the maximum file descriptor configuration of the node.
fdUsed Indicates the number of file descriptors that have been utilized on this node. Number This count includes both file and socket descriptors.

A file descriptor (FD, less frequently fildes) is an abstract indicator (handle) used to access a file or other input/output resource, such as a pipe or network socket.

A socket is an abstraction of a communication endpoint. Just as they would use file descriptors to access a file, applications use socket descriptors to access sockets. Socket descriptors are implemented as file descriptors in the UNIX System.
socketTotal Indicates the maximum number of socket descriptors that this node can use. Number A socket is an abstraction of a communication endpoint. Just as they would use file descriptors to access a file, applications use socket descriptors to access sockets.
socketDescriptorsUsed Indicates the number of socket descriptors this node is using. Number  
usedSocketDescriptors Indicates what percentage of the maximum number of socket descriptors configured for this node is presently in use. Percent A value close to 100% is a cause for concern, as it indicates that the node is running out of socket descriptors. If a node exhausts socket descriptors, then that node will block all incoming connections. To avoid this, you may want to consider increasing the maximum socket descriptor configuration of the node.
processTotal Indicates the maximum number of Erlang processes configured for this node. Number  
erlangProcessUsed Indicates the number of Erlang processes that this node is currently using. Number  
usedErlangProcess Indicates what percentage of the maximum number of Erlang processes configured for this node is currently in use. Percent Queues, connections, and channels are the main components of RabbitMQ that consume processes. This means, higher the number of queues, connections, and channels, higher will be the usage of Erlang processes.

If the value of this measure is close to 100% for a node, it implies that that node is running out of Erlang processes. This could cause databases to hang and messages to start piling up on RabbitMQ. To avoid this, you may want to consider increasing the maximum number of Erlang processes configured for the node.
memoryTotal Indicates the maximum amount of memory this node can use. MB  
memoryUsed Indicates the amount of memory in use on this node. MB  
usedMemory Indicates the percentage of memory that this node is using. Percent A value close to 100% indicates that the node is running out of memory. If the situation is allowed to persist, then the node may soon exhaust its memory completely. This could bring messaging operations to a standstill.

At this juncture, you can start by looking at some of the common memory consumers on a node are as follows:

  • Connections

  • Channels

  • Queue masters, indices, and messages kept in memory

  • Queue mirrors, indices, and messages kept in memory

  • Binaries containing message bodies and metadata

  • Plugins

  • Mnesia tables and other ETS tables that keep an in-memory copy of their data

  • Memory used by code and atoms

If any of the afore-mentioned factors are large in number, then memory consumption too is likely to increase. In such a situation, to conserve memory, see if you can control the memory consumption of the aforesaid components by decreasing their count. For instance, see if you can optimize the number of channels your applications typically use and bring that number down.

Alternatively, you may want to consider resize the memory allocation to the node.

mnesiaTransactionsRAM Indicates the rate at which RAM-only Mnesia transactions take place on this node. Transactions/Sec An Mnesia transaction is a mechanism by which a series of database operations can be executed as one functional block.

If the transaction is performed on data stored exclusively in memory, it is a RAM-only Mnesia transaction. An example of such a transaction is creation/deletion of transient queues.

If the transaction is performed on data stored in disk, it a disk transaction. An example of such a transaction is creation/deletion of durable queues.
mnesiaTransactionsDisk Indicates the rate at which Mnesia transactions take place on this node's disk. Transactions/Sec
qiJournal Indicates the rate at which message information is written to queue index journals on this node. Messages/Sec Each record in a queue index journal represents a message being published to a queue, being delivered from a queue, and being acknowledged in a queue.

If the value of this measure keeps increasing consistently, it could indicate that one/more queues reside on the target node and that there is a high level of messaging activity on that node.
storeRead Indicates the rate at which messages are read from the message store on this node. Messages/Sec Messages (the body, and any properties and / or headers) can either be stored directly in the queue index, or written to the message store.

The message store is a key-value store for messages, shared among all queues in the server. There are technically two message stores (one for transient and one for persistent messages) but they are usually considered together as “the message store”.

If the value of the storeWrite measure increases consistently, it could indicate the presence of many lazy queues on the node with many messages. To accommodate all these messages in the message store, the node will have to be sized with sufficient disk space and file descriptors.
storeWrite Indicates the rate at which messages are written to the message store on this node. Messages/Sec
qiRead Indicates the rate at which segment files are read from the queue index on this node. Messages/Sec The queue index is responsible for maintaining knowledge about where a given message is in a queue, along with whether it has been delivered and acknowledged. There is therefore one queue index per queue.

If the values reported by these measures are consistently high for a node, it could mean that one/more queues reside on the node and many messages are stored in the queues.
qiWrite Indicates the rate at which segment files are read from written to the queue index on this node. Messages/Sec
readsIO Indicates the rate at which read operations are performed on the disk on this node. IOPS Ideally, the value of this measure should be low. A consistent increase in this value could indicate excessive memory usage, and could hint at a potential memory contention. The key factor impacting memory usage is message queue length. Time-correlate the changes in the Messages in queue measure with that of the memory measure to figure out if the queue length is indeed increasing the memory pressure. In which case, you may want to initiate measures to limit the queue size.
writesIO Indicates the rate at which write operations are performed on the disk on this node. IOPS
seekIO Indicates the rate at which seek operations are performed on the disk on this node. A seek operation happens when the disk physically locates a piece of data on it, when reading/writing. IOPS
syncIO Indicates the rate at which the node invokes fsync() to ensure data is flushed to the disk. IOPS
readBandwidth Indicates the rate at which data is read by this node. MB/Sec These measures are good indicators of the bandwidth used by a node when reading/writing. You can compare the value of this measure across nodes to know which node is consuming maximum bandwidth.
writeBandwidth Indicates the rate at which data is written by this node. MB/Sec
readAverageTime Indicates the average time taken by this node to perform a read operation. Millisecs Compare the values of these measures across nodes to identify the most latent node. In the process, you can identify which type of I/O operations are taking the longest on each node - reads? writes? seeks? or syncs?
writeAverageTime Indicates the average time taken by this node to perform a write operation. Millisecs
seekAverageTime Indicates the average time taken by this node to perform a seek operation. Millisecs
syncAverageTime Indicates the average time taken by this node to complete an fsync(). Millisecs