eG Monitoring
 

Measures reported by KuberNodeTest

A node is a worker machine in Kubernetes. A node may be a VM or physical machine, depending on the cluster. Each node contains the services necessary to run pods and is managed by the master components. The services on a node include the container runtime, kubelet and kube-proxy.

A node's status contains information such as the addresses (hostname, external IP address, internal IP address of the node), conditions describing the status of running nodes, the total resource capacity of the node and the usable (allocatable) capacity, and general information pertaining to the node (eg., kernel version, Kubernetes version etc.).

Nodes are automatically managed by the Node controller. If a node is unreachable beyond a configured duration, then the node controller automatically deletes all the Pods on that node. However, sometimes, manual administration/management of nodes may become necessary. For instance, administrators may have to manually delete unreachable node objects, if the node controller is unable to do so. Likewise, if a node is to be rebooted, then the administrator will have to manually mark that node as “unschedulable”, so that new Pods do not get scheduled to that node.

While the Node controller manages the node ‘condition ’, the Kubernetes scheduler manages Pod placements by automatically comparing the resource requirement of the containers in the Pods with the total and allocatable resource capacity of the nodes, and scheduling Pods on those nodes that fit their resource profile. Sometimes, a node may run Pods that oversubscribe to the node's resources - i.e., the sum of limits of the containers on the node may exceed the total resource capacity of the node. In an overcommitted environment, it is possible that the Pods on the node will attempt to use more compute resource than is available at any given point in time. If this happens, it can degrade the performance of containerized applications, as you may have a single Pod hogging the node's resources! Administrators may hence want to be promptly alerted to a resource overcommitment, so they can quickly identify which Pod is guilty of overcommitment and determine how resource allocations and usage priorities can be tweaked to ensure performance does not suffer! Additionally, administrators may also want to track resource usage across containers on a node, so they can proactively isolate a potential resource contention and instantly initiate pre-emptive action. The Kube Cluster Nodes test does all this and more!

The test auto-discovers the nodes in a Kubernetes cluster and clearly distinguishes between the master nodes and the workers. The test then monitors the condition of each node and points administrators to those nodes whose condition is ‘unhealthy’ or have been marked as ‘unschedulable’. Additionally, the test reports the total CPU and memory capacity of every node, tracks the sum of resource requests/limits of the containers on each node, and accurately pinpoints those nodes where containers have oversubscribed to the node's capacity. Detailed diagnostics of the test lead administrators to the exact Pods that have oversubscribed to the node's resources. With the help of this information, administrators may decide to resize containers or reset resource usage priorities of containers, so that cluster performance is not compromised. Furthermore, the test reveals the percentage of a node's resources that are being utilized by the containers, thereby warning administrators of a probable contention for resources on a node.

Outputs of the test:One set of results for each node in the Kubernetes cluster being monitored.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Nodes_state Indicates whether/not this node is running.   The values that this measure can report and their corresponding numeric values are listed in the table below:

Numeric Value Measure Value
1 Running
0 Not Running
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate the state of a node. In the graph of this measure however, the same is represented using the numeric equivalents only.

In the event that this measure reports the value Not running or Unknown for a node, then you can use the detailed diagnosis of this measure to know the reason for the abnormal status.

Unschedulable_state Indicates whether/not this node is unschedulable.   By default, healthy nodes with a Ready status are marked as schedulable, meaning that new pods are allowed for placement on the node. Manually marking a node as unschedulable blocks any new pods from being scheduled on the node. Typically, nodes from which Pods need to be migrated/evacuated are candidates for being marked as ‘unschedulable’ status. Sometimes, nodes that have been unhealthy for a long time are also set as ‘unschedulable’.

The values that this measure can report and their corresponding numeric values are listed in the table below:

Numeric Value Measure Value
1 Yes
0 No
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate whether/not a node has been manually set as unschedulable. In the graph of this measure however, the same is represented using the numeric equivalents only.

Maintenance_mode Indicates whether/not this node is in the maintenance mode.   By putting a node into maintenance mode, all existing workloads will be restarted on other nodes to ensure availability, and no new workloads will be started on the node. Maintenance mode allows you to perform operations such as security updates or rebooting machines without the loss of availability.

The values that this measure can report and their corresponding numeric values are listed in the table below:

Numeric Value Measure Value
1 Enabled
0 Disabled

Note:

By default, this measure reports the Measure Values discussed above to indicate whether/not a node is in the maintenance mode. In the graph of this measure however, the same is represented using the numeric equivalents only.

Node_age Indicates how old this node is.   The value of this measure is expressed in number of days, hours, and minutes.

Use the detailed diagnosis of this measure to know more about a particular node.

Network_state Indicates whether/not the network of this node is correctly configured.   The values that this measure can report and their corresponding numeric values are listed in the table below:

Numeric Value Measure Value
1 Yes
0 No
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate the availability of a node's network. In the graph of this measure however, the same is represented using the numeric equivalents only.

If this measure reports the value Yes for a node - i.e., if the network of a node is indeed unavailable - then you can use the detailed diagnosis of this measure to figure out the reason for the unavailability.

OutOfDisk_state Indicates whether/not there is insufficient free disk space on this node for adding new Pods.   The values that this measure can report and their corresponding numeric values are listed in the table below:

Numeric Value Measure Value
1 Yes
0 No
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate whether/not a node has run out of disk space. In the graph of this measure however, the same is represented using the numeric equivalents only.

If this measure reports the value Yes for a node - i.e., if the network of a node is indeed unavailable - then you can use the detailed diagnosis of this measure to figure out the reason for the unavailability.

MemoryPressure_state Indicates whether/not this node is running low on memory.   The values that this measure can report and their corresponding numeric values are listed in the table below:

Numeric Value Measure Value
1 Yes
0 No
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate whether/not a node has sufficient memory. In the graph of this measure however, the same is represented using the numeric equivalents only.

If this measure reports the value Yes for a node - i.e., if the network of a node is indeed unavailable - then you can use the detailed diagnosis of this measure to figure out the reason for the unavailability.

Disk_pressure_state Indicates whether/not this node's disk capacity is low.   The values that this measure can report and their corresponding numeric values are listed in the table below:

Numeric Value Measure Value
1 Yes
0 No
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate whether/not a node is low on disk capacity. In the graph of this measure however, the same is represented using the numeric equivalents only.

If this measure reports the value Yes for a node - i.e., if the network of a node is indeed unavailable - then you can use the detailed diagnosis of this measure to figure out the reason for the unavailability.

Disk_pressure_state Indicates whether/not this node's disk capacity is low.   The values that this measure can report and their corresponding numeric values are listed in the table below:

Numeric Value Measure Value
1 Yes
0 No
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate whether/not a node is low on disk capacity. In the graph of this measure however, the same is represented using the numeric equivalents only.

If this measure reports the value Yes for a node - i.e., if the network of a node is indeed unavailable - then you can use the detailed diagnosis of this measure to figure out the reason for the unavailability.

PID_pressure_state Indicates whether/not too many processes are running on the node.   The values that this measure can report and their corresponding numeric values are listed in the table below:

Numeric Value Measure Value
1 Yes
0 No
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate whether/not a node is under PID pressure. In the graph of this measure however, the same is represented using the numeric equivalents only.

If this measure reports the value Yes for a node - i.e., if the network of a node is indeed unavailable - then you can use the detailed diagnosis of this measure to figure out the reason for the unavailability.

Ready_state Indicates whether/not a node is healthy and ready to accept Pods.   This measure reports the value Yes, if a node is healthy and is ready to accept Pods. The value No is reported if a node is not healthy and is not accepting Pods. The value Unknown is reported if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds).

The values that this measure can report and their corresponding numeric values are listed in the table below:

Numeric Value Measure Value
1 Yes
0 No
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate whether/not a node is ready. In the graph of this measure however, the same is represented using the numeric equivalents only.

If this measure reports the value Yes for a node - i.e., if the network of a node is indeed unavailable - then you can use the detailed diagnosis of this measure to figure out the reason for the unavailability.

CPU_capacity Indicates the total CPU capacity of this node, in terms of the number of CPU cores it supports. Number  
Memory_capacity Indicates the total memory capacity of this node. GB  
Pods_capacity Indicates the maximum number of Pods that can be scheduled on this node. Number  
Running_pods Indicates the number of Pods currently running on this node. Number If the value of this measure for a node is equal to or is growing closer to the value of the Pods capacity measure, it indicates that that node has or is about to exhaust its Pod capacity.

You can use the detailed diagnosis of this measure to know which Pods are running on the node and which containers are running within each Pod.

Pods_percent Indicates the percentage of the Pod capacity of this node that is currently being utilized. Percent The formula used to compute the value of this measure is as follows:

(Running pods/Pods capacity)*100

A value equal to or close to 100% indicates that the node has or is about to exhaust its Pod capacity. In such circumstances, you may want to consider increasing the Pod capacity of the node or freeing the node of unused/inactive Pods.

Total_container Indicates the total number of containers running on this node. Number To know which containers are running on the node, use the detailed diagnosis of this measure.
Total_millicpu Indicates the CPU capacity of this node. Millicpu  
CPU_limits Indicates the total amount of CPU resources that containers on this node are allowed to use. Millicpu The value of this measure is the sum of CPU limits set for the individual containers across all the Pods running on this node.

If the value of this measure is greater than the value of the CPU capacity measure, it could mean that one/more Pods have oversubscribed to the node's CPU capacity.

CPU_requests Indicates the minimum amount of CPU resources guaranteed to all the containers on this node. Millicpu The value of this measure is the sum of CPU requests configured for the individual containers across all the Pods running on this node.
Memory_limits Indicates the total amount of memory resources that containers on this node are allowed to use. GB The value of this measure is the sum of CPU requests configured for the individual containers across all the Pods running on this node.
Memory_request Indicates the minimum amount of memory resources guaranteed to all the containers on this node. GB The value of this measure is the sum of memory requests configured for the individual containers across all the Pods running on this node.
Percent_CPU_limits Indicates what percentage of the capacity of this node is allocated as CPU limits to containers. In other words, this is the percentage of a node's CPU capacity that the containers on that node are allowed to use. Percent The formula used for computing this measure is as follows:

(CPU limits/CPU capacity)*100

If the value of this measure exceeds 100%, it means that the node is overcommitted. In other words, it means that the Pods on the node have been allowed to use more resources than the node's capacity. In such a situation, you may want to look up the detailed diagnostics of this measure to identify the Pods that are contributing to the overcommitment.

Percent_Memory_limits Indicates what percentage of the memory capacity of this node is allocated as memory limits to containers. In other words, this is the percentage of a node's memory capacity that the containers on that node are allowed to use. Percent The formula used for computing this measure is as follows:

(Memory limits/Memory capacity)*100

If the value of this measure exceeds 100%, it means that the node is overcommitted. In other words, it means that the Pods on the node have been allowed to use more resources than the node's capacity. In such a situation, you may want to look up the detailed diagnostics of this measure to identify the Pods that are contributing to the overcommitment.

Percent_CPU_request Indicates what percentage of the total CPU capacity of this node is set as CPU requests for the containers on that node. In other words, this is the percentage of a node's CPU capacity that the containers on that node are guaranteed to receive. Percent The formula used for computing this measure is as follows:

(CPU requests/CPU capacity)*100

Compare the value of this measure across nodes to know which node has been guaranteed the maximum CPU resources. You can even use the detailed diagnosis of this measure to identify the specific Pods in that node with the maximum CPU requests.

Percent_Memory_request Indicates what percentage of the total memory capacity of this node is set as memory requests for the containers on that node. In other words, this is the percentage of a node's memory capacity that the containers on that node are guaranteed to receive. Percent The formula used for computing this measure is as follows:

(Memory requests/Memory capacity)*100

Compare the value of this measure across nodes to know which node has been guaranteed the maximum memory resources. You can even use the detailed diagnosis of this measure to identify the specific Pods in that node with the maximum memory requests.

CPU_over_commit Indicates whether/not this node is overcommitted in terms of CPU resources. Percent If the CPU_over_commit measure reports a value greater than 100% for a node, then this measure will report the value True for that node. This implies that the node's CPU resources are overcommitted. On the other hand, if the CPU_over_commit measure of a node reports a value lesser than 100%, then this measure will report the value False for that node. This implies that the node's CPU resources are not overcommitted.

The values that this measure reports and their corresponding numeric values are detailed in the table below:

Numeric Value Measure Value
1 True
0 False

Note:

By default, this measure reports the Measure Values discussed above to indicate whether/not a node's CPU resources are overcommitted. In the graph of this measure however, the same is represented using the numeric equivalents only.

In an overcommitted environment, it is possible that the Pods on the node will attempt to use more compute resource than is available at any given point in time. To know which Pods are using more resources than the node's capacity, use the detailed diagnosis of this measure.

When an overcommitment occurs, the node must give priority to one Pod over another. The facility used to make this decision is referred to as a Quality of Service (QoS) Class. By assigning a QOS class to each container, administrators can make sure that the performance of mission-critical applications does not suffer owing to insufficient resources.

For each compute resource, a container is divided into one of three QoS classes with decreasing order of priority:

  • Priority 1 (highest) - Guaranteed - If limits and optionally requests are set (not equal to 0) for all resources and they are equal, then the container is classified as Guaranteed. Guaranteed containers are considered top priority, and are guaranteed to only be terminated if they exceed their limits, or if the system is under resource pressure and there are no lower priority containers that can be evicted.

  • Priority 2 - Burstable - If requests and optionally limits are set (not equal to 0) for all resources, and they are not equal, then the container is classified as Burstable. Burstable containers under resource pressure are more likely to be terminated once they exceed their requests and no other BestEffort containers exist.

  • Priority 3 (lowest) - BestEffort - If requests and limits are not set for any of the resources, then the container is classified as BestEffort. BestEffort containers are treated with the lowest priority. Processes in these containers are first to be terminated if the system runs out of resources.

Administrators can also control the level of overcommit and manage container density on nodes. For this, masters can be configured to override the ratio between request and limit set on developer containers. In conjunction with a per-project LimitRange specifying limits and defaults, this adjusts the container limit and request to achieve the desired level of overcommit.

Memory_over_commit Indicates whether/not this node is overcommitted in terms of memory resources. Percent If the Memory_over_commit measure reports a value greater than 100% for a node, then this measure will report the value True for that node. This implies that the node's memory resources are overcommitted. On the other hand, if the Memory_over_commit measure of a node reports a value lesser than 100%, then this measure will report the value False for that node. This implies that the node's memory resources are not overcommitted.

The values that this measure reports and their corresponding numeric values are detailed in the table below:

Numeric Value Measure Value
1 True
0 False

Note:

By default, this measure reports the Measure Values discussed above to indicate whether/not a node's memory resources are overcommitted. In the graph of this measure however, the same is represented using the numeric equivalents only.

In an overcommitted environment, it is possible that the Pods on the node will attempt to use more compute resource than is available at any given point in time. To know which Pods are using more resources than the node's capacity, use the detailed diagnosis of this measure.

When an overcommitment occurs, the node must give priority to one Pod over another. The facility used to make this decision is referred to as a Quality of Service (QoS) Class. By assigning a QOS class to each container, administrators can make sure that the performance of mission-critical applications does not suffer owing to insufficient resources.

For each compute resource, a container is divided into one of three QoS classes with decreasing order of priority:

  • Priority 1 (highest) - Guaranteed - If limits and optionally requests are set (not equal to 0) for all resources and they are equal, then the container is classified as Guaranteed. Guaranteed containers are considered top priority, and are guaranteed to only be terminated if they exceed their limits, or if the system is under resource pressure and there are no lower priority containers that can be evicted.

  • Priority 2 - Burstable - If requests and optionally limits are set (not equal to 0) for all resources, and they are not equal, then the container is classified as Burstable. Burstable containers under resource pressure are more likely to be terminated once they exceed their requests and no other BestEffort containers exist.

  • Priority 3 (lowest) - BestEffort - If requests and limits are not set for any of the resources, then the container is classified as BestEffort. BestEffort containers are treated with the lowest priority. Processes in these containers are first to be terminated if the system runs out of resources.

Administrators can also control the level of overcommit and manage container density on nodes. For this, masters can be configured to override the ratio between request and limit set on developer containers. In conjunction with a per-project LimitRange specifying limits and defaults, this adjusts the container limit and request to achieve the desired level of overcommit.

Total_image Indicates the total number of images on this node. Number Use the detailed diagnosis of this measure to know which images are on the node.
Used_image Indicates the total number of images currently used by the containers on this node. Number To view the used images, use the detailed diagnosis of this measure.
NotUsed_Image Indicates the number of images still to be used by the containers on this node. Number To view the unused images, use the detailed diagnosis of this measure.
Image_size Indicates the total size of images on this node. GB  
NodeType Indicates the node type. GB A node can be a Master node or a Worker node in a cluster. A cluster has at least one worker node and at least one master node.The worker node(s) host the pods that are the components of the application. The master node(s) manages the worker nodes and the pods in the cluster. Multiple master nodes are used to provide a cluster with failover and high availability.

If a node is the master node in a cluster, then this measure will report the value Master. For a worker node, this measure will report the value Worker.

The numeric values that correspond to these measure values are as follows:

Numeric Value Measure Value
1 Master
2 Worker

Note:

By default, this measure reports the Measure Values discussed above to indicate the node type. In the graph of this measure however, the same is indicated using the numeric equivalents only.

MilliCpu_usage Indicates the amount of CPU resources used by this node. Millicpu Ideally, the value of this measure should be much lesser than the value of the Total_millicpu measure. If the value of this measure is equal to or is rapidly approaching the value of the Total_millicpu measure, it means that the node is running out of CPU resources.
Cpu_usage Indicates the percentage of CPU resources utilized by this node. Percent A value close to 100% is indicative of excessive CPU usage by a node, and hints at a potential CPU contention on the node.

A value greater than 100% implies that one/more Pods have probably over-subscribed to the node's capacity.

To know which Pod on the node is contributing to the contention/overcommitment, use the detailed diagnosis of this measure.

Memory_usage Indicates the amount of memory resources used by this node. Millicpu Ideally, the value of this measure should be much lesser than the value of the Memory capacity measure. If the value of this measure is equal to or is rapidly approaching the value of the Memory capacity measure, it means that the node is running out of memory resources.
Memory_utilization Indicates the percentage of memory resources utilized by this node. Percent A value close to 100% is indicative of excessive memory usage by a node, and hints at a potential memory contention on the node.

A value greater than 100% implies that one/more Pods have probably over-subscribed to the node's capacity.

To know which Pod on the node is contributing to the contention/overcommitment, use the detailed diagnosis of this measure.