Measures reported by RabMQNodesTest
A RabbitMQ cluster is a logical grouping of one or several Erlang nodes, each running the RabbitMQ application and sharing users, virtual hosts, queues, exchanges, bindings, and runtime parameters.
A client can connect to any node and perform any operation. Nodes will route operations to the queue master node transparently to clients. In case of a node failure, clients will be able to reconnect to a different node, recover their topology and continue operation. Regardless of which node is serving client requests, at any point in time, administrators should be able to tell the operational state of each node in the cluster, so that the failed nodes can be identified.
Moreover, client connections, channels, and queues are distributed across cluster nodes. This means that all nodes in a cluster should be sized with adequate resources such as memory, bandwicth, disk space, file/socket handlers, and Erlang processes. Administrators should be able to track the usage of these critical resources on each node, and pinpoint the node that is under-sized.
Additionally, administrators should observe the reads from and writes to the queue index journals, message store, and disk of each node to gauge the level of activity on each node and measure a node's ability to handle these activity levels.
The RabMQNodesTest test enables administrators to perform all the above! This test auto-discovers the nodes in a target cluster. For each node, the test then reports the state of that node, its uptime, and how its memory, file descriptors, socket descriptors, bandwidth resources and Erlang processes have been utilized. Nodes that are down and those that are running out of resources are revealed in the process. Furthermore, the test reports the rate at which reads, writes, seeks, and syncs were performed on the disk of each node, thus revealing the I/O processing ability of each node. The time taken by every node to perform these I/O operations is also reported, so that latent nodes can be identified.
Outputs of the test : One set of results for each node in the monitored RabbitMQ Cluster.
The measures made by this test are as follows:
| Measurement |
Description |
Measurement Unit |
Interpretation |
| status |
Indicates the current state of this node. |
|
The values that this measure can take and their corresponding numeric values are as follows:
| Measure Value |
Numeric Value |
| Running |
1 |
| Stopped |
0 |
Note:
This test reports the Measure Values listed in the table above to indicate the current operational state of a node. In the graph of this measure however, the same will be represented using the numeric equivalents.
|
| uptime |
Indicates the uptime of this node (in days). |
Days |
Compare the value of this measure across nodes to identify the node that has been down for the longest time. |
| fdTotal |
Indicates the maximum number of file descriptors that this node can use. |
Number |
By default, a node can use upto a maximum of 1024 file descriptors.
A file descriptor (FD, less frequently fildes) is an abstract indicator (handle) used to access a file or other input/output resource, such as a pipe or network socket. |
| usedFD |
Indicates what percentage of the maximum number of file descriptors configured for this node is currently in use. |
Percent |
A value close to 100% is a cause for concern, as it indicates that the node is running out of file handles. If the value reaches 100%, then the node will block all incoming connections. To avoid this, you may want to consider increasing the maximum file descriptor configuration of the node. |
| fdUsed |
Indicates the number of file descriptors that have been utilized on this node. |
Number |
This count includes both file and socket descriptors.
A file descriptor (FD, less frequently fildes) is an abstract indicator (handle) used to access a file or other input/output resource, such as a pipe or network socket.
A socket is an abstraction of a communication endpoint. Just as they would use file descriptors to access a file, applications use socket descriptors to access sockets. Socket descriptors are implemented as file descriptors in the UNIX System. |
| socketTotal |
Indicates the maximum number of socket descriptors that this node can use. |
Number |
A socket is an abstraction of a communication endpoint. Just as they would use file descriptors to access a file, applications use socket descriptors to access sockets. |
| socketDescriptorsUsed |
Indicates the number of socket descriptors this node is using. |
Number |
|
| usedSocketDescriptors |
Indicates what percentage of the maximum number of socket descriptors configured for this node is presently in use. |
Percent |
A value close to 100% is a cause for concern, as it indicates that the node is running out of socket descriptors. If a node exhausts socket descriptors, then that node will block all incoming connections. To avoid this, you may want to consider increasing the maximum socket descriptor configuration of the node. |
| processTotal |
Indicates the maximum number of Erlang processes configured for this node. |
Number |
|
| erlangProcessUsed |
Indicates the number of Erlang processes that this node is currently using. |
Number |
|
| usedErlangProcess |
Indicates what percentage of the maximum number of Erlang processes configured for this node is currently in use. |
Percent |
Queues, connections, and channels are the main components of RabbitMQ that consume processes. This means, higher the number of queues, connections, and channels, higher will be the usage of Erlang processes.
If the value of this measure is close to 100% for a node, it implies that that node is running out of Erlang processes. This could cause databases to hang and messages to start piling up on RabbitMQ. To avoid this, you may want to consider increasing the maximum number of Erlang processes configured for the node. |
| memoryTotal |
Indicates the maximum amount of memory this node can use. |
MB |
|
| memoryUsed |
Indicates the amount of memory in use on this node. |
MB |
|
| usedMemory |
Indicates the percentage of memory that this node is using. |
Percent |
A value close to 100% indicates that the node is running out of memory. If the situation is allowed to persist, then the node may soon exhaust its memory completely. This could bring messaging operations to a standstill.
At this juncture, you can start by looking at some of the common memory consumers on a node are as follows:
Connections
Channels
Queue masters, indices, and messages kept in memory
Queue mirrors, indices, and messages kept in memory
Binaries containing message bodies and metadata
Plugins
Mnesia tables and other ETS tables that keep an in-memory copy of their data
Memory used by code and atoms
If any of the afore-mentioned factors are large in number, then memory consumption too is likely to increase. In such a situation, to conserve memory, see if you can control the memory consumption of the aforesaid components by decreasing their count. For instance, see if you can optimize the number of channels your applications typically use and bring that number down.
Alternatively, you may want to consider resize the memory allocation to the node.
|
| mnesiaTransactionsRAM |
Indicates the rate at which RAM-only Mnesia transactions take place on this node. |
Transactions/Sec |
An Mnesia transaction is a mechanism by which a series of database operations can be executed as one functional block.
If the transaction is performed on data stored exclusively in memory, it is a RAM-only Mnesia transaction. An example of such a transaction is creation/deletion of transient queues.
If the transaction is performed on data stored in disk, it a disk transaction. An example of such a transaction is creation/deletion of durable queues. |
| mnesiaTransactionsDisk |
Indicates the rate at which Mnesia transactions take place on this node's disk. |
Transactions/Sec |
| qiJournal |
Indicates the rate at which message information is written to queue index journals on this node. |
Messages/Sec |
Each record in a queue index journal represents a message being published to a queue, being delivered from a queue, and being acknowledged in a queue.
If the value of this measure keeps increasing consistently, it could indicate that one/more queues reside on the target node and that there is a high level of messaging activity on that node. |
| storeRead |
Indicates the rate at which messages are read from the message store on this node. |
Messages/Sec |
Messages (the body, and any properties and / or headers) can either be stored directly in the queue index, or written to the message store.
The message store is a key-value store for messages, shared among all queues in the server. There are technically two message stores (one for transient and one for persistent messages) but they are usually considered together as “the message store”.
If the value of the storeWrite measure increases consistently, it could indicate the presence of many lazy queues on the node with many messages. To accommodate all these messages in the message store, the node will have to be sized with sufficient disk space and file descriptors. |
| storeWrite |
Indicates the rate at which messages are written to the message store on this node. |
Messages/Sec |
| qiRead |
Indicates the rate at which segment files are read from the queue index on this node. |
Messages/Sec |
The queue index is responsible for maintaining knowledge about where a given message is in a queue, along with whether it has been delivered and acknowledged. There is therefore one queue index per queue.
If the values reported by these measures are consistently high for a node, it could mean that one/more queues reside on the node and many messages are stored in the queues. |
| qiWrite |
Indicates the rate at which segment files are read from written to the queue index on this node. |
Messages/Sec |
| readsIO |
Indicates the rate at which read operations are performed on the disk on this node. |
IOPS |
Ideally, the value of this measure should be low. A consistent increase in this value could indicate excessive memory usage, and could hint at a potential memory contention. The key factor impacting memory usage is message queue length. Time-correlate the changes in the Messages in queue measure with that of the memory measure to figure out if the queue length is indeed increasing the memory pressure. In which case, you may want to initiate measures to limit the queue size. |
| writesIO |
Indicates the rate at which write operations are performed on the disk on this node. |
IOPS |
| seekIO |
Indicates the rate at which seek operations are performed on the disk on this node. A seek operation happens when the disk physically locates a piece of data on it, when reading/writing. |
IOPS |
| syncIO |
Indicates the rate at which the node invokes fsync() to ensure data is flushed to the disk. |
IOPS |
| readBandwidth |
Indicates the rate at which data is read by this node. |
MB/Sec |
These measures are good indicators of the bandwidth used by a node when reading/writing. You can compare the value of this measure across nodes to know which node is consuming maximum bandwidth. |
| writeBandwidth |
Indicates the rate at which data is written by this node. |
MB/Sec |
| readAverageTime |
Indicates the average time taken by this node to perform a read operation. |
Millisecs |
Compare the values of these measures across nodes to identify the most latent node. In the process, you can identify which type of I/O operations are taking the longest on each node - reads? writes? seeks? or syncs? |
| writeAverageTime |
Indicates the average time taken by this node to perform a write operation. |
Millisecs |
| seekAverageTime |
Indicates the average time taken by this node to perform a seek operation. |
Millisecs |
| syncAverageTime |
Indicates the average time taken by this node to complete an fsync(). |
Millisecs |
|
|