eG Monitoring
 

Measures reported by IgniteClusMeTest

A node is assigned either one of the two roles: server node or client node. Server nodes are the workhorses of the cluster; they cache data, execute compute tasks, etc. Client nodes join the topology as regular nodes but they do not store data. Client nodes are used to stream data into the cluster and execute user queries.

A cluster is the core of Apache Ignite as it provides key benefits of distributed computing like load balancing, resilience, failover, high availability etc. If there is any issue in communication between cluster nodes, cluster job execution etc, Ignite will not be able to serve the application in optimal manner. That's the reason it is important to monitor Ignite Cluster so that any issues can be highlighted before it can affect cluster performance.

This test monitors Ignite cluster gather key statistics related to jobs, CPU load, waiting time etc. which provide key insights into it's health. These can help administrators help improve cluster performance, identify issues, failures and add right infrastructure at right time before application performance is affected.

Outputs of the test: One set of results for each Apache Ignite Server

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
activeBaselineNodes Indicates the total number of nodes which are currently in the cluster baseline topology. Number

The capacity of cluster to hold data depends on the number of nodes and node configuration. If not enough number of nodes are available, the data requirements might not be met.

averageActiveJobs Indicates the average number of active jobs executing at any given time on the each node. Number

The number of active jobs dictate the cluster CPU capacity occupied. You must ensure that number of active jobs is optimal and there is enough CPU capacity available for new jobs.

averageCancelledJobs Indicates the average number of jobs cancelled on each node. Seconds

 

averageCpuLoad Indicates the CPU load values averaged over all metrics kept in history across all nodes. Percentage

The number of active jobs dictate the cluster CPU capacity occupied. You must ensure that number of active jobs is optimal and there is enough CPU capacity available for new jobs, so the new jobs don't have to wait too long.

averageJobExecuteTime Indicates the average time a job takes to execute on any node in the cluster. Seconds

This is important metric to understand how many jobs can be scheduled and predict the jobs timings.

averageJobWaitTime Indicates the average time a jobs waits in the queue before it is picked for execution. Seconds

If the wait time is too high it means more compute capacity is required, so adding new nodes can help.

averageRejectedJobs Indicates the average number of jobs rejected across all nodes during collision resolution operations. Number

 

averageWaitingJobs Indicates the average number of jobs waiting per node across all nodes in the cluster. Number

If the number of waiting jobs is too high it means more compute capacity is required, so adding new nodes can help.

receivedMessagesCount Indicates the rate at which messages are received by nodes in the cluster. Messages/Sec

 

sentMessagesCount Indicates the rate at which messages are sent by nodes in the cluster. Messages/Sec

If the rate is going down over a number of measurements, the communication between nodes needs to be improved.

idleTimePercentage Indicates the percentage time any given node is idle vs executing jobs. Percentage

If there is too much idle time means the system has more capacity than required. Administrators can consider removing nodes and using them for other purpose.

totalServerNodes Indicates the total number of server nodes in the cluster. Number

 

totalClientNodes Indicates the total number of client nodes in the cluster. Number