eG Monitoring
 

Measures reported by IgniteTcpDisTest

Apache Ignite is a distributed data system and is relies on data distributed over nodes for storage, resilience and scalability. Given the nature of Ignite, one of the key capabilities is to discover and add nodes to the cluster. TCP Discovery SPI defines the network parameters of the default discovery mechanism, which uses the TCP/IP protocol to exchange discovery messages and is implemented in the TcpDiscoverySpi class. The properties of discovery mechanism can be changed by defining the configuration accordingly.

Given the importance of discovery mechanism, it is important to monitor this SPI and ensure that when new nodes are introduced they are quickly added to the system.

This test monitors the TCP Discovery SPI to collect key statistics which can help administrators understand the state of SPI and communication and take action if there is an issue.

Outputs of the test: One set of results for each Apache Ignite Server

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
spiState Indicates the current state of Service Provider Interface. Boolean

If the state of SPI is not healthy, the discovery and communication between the nodes will stop. Administrators need to ensure that SPI is working optimally before starting the cluster.

nodesFailed Indicates the number of nodes which are in failed state. Number

If there are too many failed nodes in the cluster, which are still there in cluster config, it may slow down the cluster startup.

nodesJoined Indicates the number of nodes joined since the cluster is started. Number

If the nodes are able to join and seamlessly integrating in the cluster, it is good sign that SPI is working fine.

nodesLeft Indicates the number of nodes left since the cluster is started. Number

If too many nodes have left the cluster recently, it may be needed to remove from config o therwise it will slow down the inter node communication.

pendingMessageDiscarded Indicates the number of messages discarded because the target node could not be discovered. Number

If there are too many pending messages which are discarded, it means some nodes have left the cluster but cluster config is not updated.

pendingMessageRegistered Indicates the number of messages which are yet to be delivered to the target node. Number

If there are too many pending messages which are not discarded yet, it means some nodes have left the cluster but cluster config is not updated.

totalProcessedMessage Indicates the total number of messages processed per second through discovery SPI. Messages/Sec

If this rate is going down over the range of measurements, you need to investigate the same.

totalReceivedMessage Indicates the total number of messages received per second through discovery SPI. Messages/Sec

A low value is desired for this measure.

messageWorkerQueueSize Indicates the size of the queue of discovery messages that are waiting to be sent to other nodes. MB

Worker queue size should be maintained at an optimal value.

avgMessageProcessingTime Indicates the average time taken by each message to process through the system. Seconds

Look at the trends and if the processing time is going up over a range of measurements, it would be a matter of concern.