eG Monitoring
 

Measures reported by KuberPodTest

Pods are the smallest deployable uni ts of computing that can be created and managed in Kubernetes. A Pod (as in a pod of whales or pea pod) is a group of one or more containers (such as Docker containers ), with shared storage/network, and a specification for how to run the containers. A Pod’s contents are always co-located and co-scheduled, and run in a shared context.

Pods are created, assigned a unique ID (UID), and scheduled to nodes where they remain until termination (according to restart policy) or deletion. If a Node dies, the Pods scheduled to that node are scheduled for deletion, after a timeout period. At any given point in time, an administrator needs to know at which phase a Pod is in its life cycle, so they can promptly detect Pod failures or undue slowness in Pod creation and rapidly initiate investigations into the same. This is necessary because, if a Pod fails, then the cluster's actual state may go out of sync with its desired state.

Once a Pod is assigned to a node by scheduler, kubelet starts creating containers using container runtime. Alongside status of Pods, an administrator also needs to keep track of the status of containers at all times, as container failures impact the availability and performance of the containerized applications. This way, administrators can detect and resolve issues in containerized applications before end-users notice.

Typically, when Pods run containers, they use the CPU and memory resources on the node to which they are scheduled. By default, a Pod in Kubernetes will run with no limits on CPU and memory. This means that a single Pod can end up hogging the resources of the node! To avoid this, administrators can control the amount of CPU and memory resources each container in a Pod can use by setting resource requests and limits in the Pod configuration file. A Pod can use as much compute resources as represented by the sum of requests and limits of all containers in that Pod. This means that if the per container limits are not prudently set, then you could have Pods that over-subscribe to the node's capacity. Also, if containers are not sized according to their actual usage, then it can adversely impact the performance of the containerized applications. This is why, it is imperative that administrators track the actual resource usage of Pods, proactively detect potential resource contentions, and tweak usage limits and/or priorities to prevent such contentions. The Pods by Namespace test helps administrators perform all of the above!

This test auto-discovers the Pods in each Namespace, and reports the status of each Pod and that of the containers in every Pod. This leads administrators to Pods and containers in an abnormal state. Additionally, the test reports the resource requests and limits for each Pod, the resource capacity of the Node to which each Pod is scheduled, and actual resource utilization. In the process, the test accurately pinpoints those Pods that are over-subscribing to the node's capacity and those Pods that may potentially cause a contention for resources on the node. Since the test also reveals the QoS priority setting of each Pod, administrators can also figure out if a change in priority can help prevent probable resource contentions/overcommitment.

Outputs of the test:One set of results for each Pod in every namespace in the Kubernetes cluster being monitored.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Status Indicates where this Pod is in its lifecycle.   A Pod can be in one of the following phases in its lifecycle:
  • Pending: The Pod has been accepted by the Kubernetes system, but one or more of the Container images has not been created. This includes time before being scheduled as well as time spent downloading images over the network, which could take a while.

  • Running: The Pod has been bound to a node, and all of the Containers have been created. At least one Container is still running, or is in the process of starting or restarting.

  • Succeeded: All Containers in the Pod have terminated in success, and will not be restarted.

  • Failed: All Containers in the Pod have terminated, and at least one Container has terminated in failure. That is, the Container either exited with non-zero status or was terminated by the system.

  • Unknown: For some reason the state of the Pod could not be obtained, typically due to an error in communicating with the host of the Pod.

  • CrashLoopBackoff: A Pod is starting, crashing, starting again, and then crashing again.

The numeric values that correspond to this are detailed in the table below:

Numeric Value Measure Value
1 Running
2 Succeeded
3 Completed
4 Failed
5 Pending
6 CrashLoopBackOff
7 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate the state of a Pod. In the graph of this measure however, the state is indicated using the numeric equivalents only.

Use the detailed diagnosis of this measure to know which containers are in the Pod, the images used by the containers, and the reason for the status.

Pods_age Indicates how old this Pod is.   The value of this measure is expressed in number of days, hours, and minutes.

Use the detailed diagnosis of this measure to know which node a Pod is scheduled to, the IP address of the Pod, and the images used by the containers in the Pod.

Termination_period Shows the optional duration in seconds the Pod needs to terminate gracefully. Seconds Because Pods represent running processes on nodes in the cluster, it is important to allow those processes to gracefully terminate when they are no longer needed (vs being violently killed with a KILL signal and having no chance to clean up). Users should be able to request deletion and know when processes terminate, but also be able to ensure that deletes eventually complete. When a user requests deletion of a Pod, the system records the intended grace period before the Pod is allowed to be forcefully killed, and a TERM signal is sent to the main process in each container. Once the grace period has expired, the KILL signal is sent to those processes, and the Pod is then deleted from the API server. The default grace period is 30 seconds.

The kubectl delete command supports the --grace-period= option which allows a user to override the default and specify their own value. The value 0 force deletes the Pod. You must specify an additional flag --force along with --grace-period=0 in order to perform force deletions.

QoS_class Indicates the Quality of Service (QOS) classification assigned to this Pod based on resource requirement.   Kubernetes provides different levels of Quality of Service to pods depending on what they request and what limits are set for them. Pods that need to stay up and consistently good can request guaranteed resources, while pods with less exacting requirements can use resources with less/no guarantee.

For each resource, Kubernetes divide Pods into 3 QoS classes: Guaranteed, Burstable, and Best-Effort, in decreasing order of priority.

  • Guaranteed: Pods are considered top-priority and are guaranteed to not be killed until they exceed their limits. If limits and optionally requests (not equal to 0) are set for all resources across all containers and they are equal, then the pod is classified as Guaranteed.

  • Burstable: Pods have some form of minimal resource guarantee, but can use more resources when available. Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist. If requests and optionally limits are set (not equal to 0) for one or more resources across one or more containers, and they are not equal, then the pod is classified as Burstable.

  • Best-Effort: Pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. These containers can use any amount of free memory in the node though. If requests and limits are not set for all of the resources, across all containers, then the pod is classified as Best-Effort.

This test reports one of the above 3 QOS classes as the value of this measure. The numeric values that correspond to these measure values are as follows:

Numeric Value Measure Value
1 Guaranteed
2 Burstable
3 Best Effort

Note:

By default, this measure reports the Measure Values discussed above to indicate the QOS class of a Pod. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Restart_policy Indicates the restart policy of all containers within this Pod.   This measure reports one of the following values:
  • Always: This means that the container will be restarted even if it exited with a zero exit code (i.e. successfully). This is useful when you do not care why the container exited, you just want to make sure that it is always running (e.g. a web server). This is the default.

  • OnFailure: This means that the container will only be restarted if it exited with a non-zero exit code (i.e. something went wrong). This is useful when you want accomplish a certain task with the pod, and ensure that it completes successfully - if it does not it will be restarted until it does.

  • Never: This means that the container will not be restarted regardless of why it exited.

The numeric values that correspond to these measure values are as follows:

Numeric Value Measure Value
1 Always
2 OnFailure
3 Never

Note:

By default, this measure reports the Measure Values discussed above to indicate the restart policy of the containers in a Pod. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Initialized_state Indicates whether/not the init containers (if any) in this Pod have started successfully.   Init containers are specialized containers that run before app containers in a Pod. Init containers can contain utilities or setup scripts not present in an app image.

The values that this measure reports and their corresponding numeric values are detailed in the table below:

Numeric Value Measure Value
1 Yes
0 No
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate the status of Init containers. In the graph of this measure however, the same is indicated using the numeric equivalents only.

If this measure reports the value No or Unknown for a Pod, then you can use the detailed diagnosis of this measure to figure out the reason for the same.

Ready_state Indicates whether/not this Pod is ready.   If a Pod is in the Ready state, it means that the Pod is able to serve requests and should be added to the load balancing pools of all matching Services.

The values that this measure reports and their corresponding numeric values are detailed in the table below:

Numeric Value Measure Value
1 Yes
0 No
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate the Ready state of a Pod. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Container_state Indicates whether/not all containers in this Pod are ready.   If a container is in the Ready state, it means that the container is ready to service requests.

The values that this measure reports and their corresponding numeric values are detailed in the table below:

Numeric Value Measure Value
1 Yes
0 No
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate whether/not the containers in a Pod are ready. In the graph of this measure however, the same is indicated using the numeric equivalents only.

PodSchedule_state Indicates whether/not this Pod has been scheduled to a node.   The values that this measure reports and their corresponding numeric values are detailed in the table below:

Numeric Value Measure Value
1 Yes
0 No
2 Unknown

Note:

By default, this measure reports the Measure Values discussed above to indicate whether/not a Pod has been scheduled to a node. In the graph of this measure however, the same is indicated using the numeric equivalents only.

If this measure reports the value No for a Pod - i.e., if a Pod is not scheduled to a node - then you can use the detailed diagnosis of this measure to figure out the reason for the anomaly.

Container_count Indicates the count of containers in this Pod. Number  
Volumemount_count Indicates the count of volumes mounted in this Pod. Number  
InitContainer_count Indicates the total number of init containers (if any) in this Pod. Number Init containers are specialized containers that run before app containers in a Pod. Init containers can contain utilities or setup scripts not present in an app image.
Priority Indicates the priority class assigned to this Pod.   You can assign pods a priority class, which is a non-namespaced object that defines a mapping from a name to the integer value of the priority. The higher the value, the higher the priority.

A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than one billion for critical pods that should not be preempted or evicted.

There are two reserved priority classes for for critical system pods to have guaranteed scheduling.

  • System-node-critical: This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node.

  • System-cluster-critical: This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the system-node-critical priority class can take priority. However, this priority class does ensure guaranteed scheduling.

This test reports one of the above two priority classes as the value of this measure. The numeric values that correspond to these measure values are as follows:

Numeric Value Measure Value
1 System-cluster-critical
2 System-node-critical

Note:

By default, this measure reports the Measure Values discussed above to indicate the priority class assigned to a Pod. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Running_state Indicates the count of running containers in this Pod. Number If a container is in the Running state, it indicates that the container is executing without any issues.

Use the detailed diagnosis of this measure to know which containers in a Pod are in the Running state.

Terminate_state Indicates the count of containers in this Pod that are in a Terminated state. Number If a container is in the Terminated state, it means that the container completed its execution and has stopped running. A container enters into this when it has successfully completed execution or when it has failed for some reason.

If the containers in a Pod entered this state because they have failed, then use the detailed diagnosis of this measure to know which are those containers, why the failure occurred, and the exit code.

Waiting_state Indicates the count of containers in this Pod that are in a Waiting state. Number Waiting state is the default state of a container. If container is not in either Running or Terminated state, it is in Waiting state. A container in Waiting state still runs its required operations, like pulling images, applying Secrets, etc.

Use the detailed diagnosis of this measure to know which containers are in the Waiting state and why.

Container_uptime Indicates the total time for which the containers in this Pod were up and running. Seconds  
Total_restart_count Indicates the number of times the containers in this Pod have been restarted. Number Use the detailed diagnosis of this measure to identify the containers that were restarted and to determine the number of times each container was restarted. Frequently restarted containers can thus be isolated.
CPU_requests Indicates the minimum CPU resources guaranteed to this Pod. Millicpu This is the sum of CPU requests configured for all containers in a Pod.

A request is the amount of that resource that the system will guarantee to the Pod.

CPU_limits Indicates that maximum amount of CPU resources that this Pod can use. Millicpu This is the sum of CPU limits set for all containers in a Pod.

A limit is the maximum amount that the system will allow the Pod to use.

CPU_capacity Indicates the total number of CPU cores available to the node to which this Pod is scheduled. Number  
Total_millicpu Indicates the CPU capacity of the node to which this Pod is scheduled. Millicpu  
Percent_cpu_limits Indicates what percentage of the capacity of the node is allocated as CPU limits to containers in this Pod. In other words, this is the percentage of a node's CPU capacity that the containers on this Pod are allowed to use. Percent The formula used for computing this measure is as follows:

(CPU limits/CPU capacity on node)*100

If the value of this measure exceeds 100%, it means that the Pod is oversubscribing to the node's capacity. In other words, it means that the Pod has been allowed to use more resources than the node's capacity.

Percent_cpu_request Indicates what percentage of the total CPU capacity of the node is set as CPU requests for the containers on this Pod. In other words, this is the percentage of a node's CPU capacity that the containers on this Pod are guaranteed to receive. Percent The formula used for computing this measure is as follows:

(CPU requests/CPU capacity on node)*100

Compare the value of this measure across Pods to know which Pod has been guaranteed the maximum CPU resources.

MilliCpu_usage Indicates the amount of CPU resources used by this Pod. Millicpu Ideally, the value of this measure should be much lesser than the value of the CPU capacity on node measure. If the value of this measure is equal to or is rapidly approaching the value of the CPU capacity on node measure, it means that the Pod is over-utilizing the CPU resources of the node.
Cpu_usage Indicates the percentage of CPU resources utilized by this Pod. Percent A value close to 100% is indicative of excessive CPU usage by a Pod, and hints at a potential CPU contention on the node.

A value greater than 100% implies that the Pod has probably over-subscribed to the node's capacity.

NonCpu_limit Indicates the number of containers in this Pod for which CPU limits are not set. Number If limit is not set, then if defaults to 0 (unbounded).
NonCpu_request Indicates the number of containers in this Pod for which CPU requests are not set. Number In the case that request is not set for a container, it defaults to limit.
Memory_request Indicates the minimum memory resources guaranteed to this Pod. GB This is the sum of memory requests configured for all containers in a Pod.

A request is the amount of that resource that the system will guarantee to the Pod.

Memory_limits Indicates the maximum amount of memory resources that this Pod can use. GB This is the sum of memory limits set for all containers in a Pod.

A request is the amount of that resource that the system will guarantee to the Pod.

Memory_capacity Indicates the memory capacity of the node to which this Pod is scheduled. GB  
Percent_memory_limits Indicates what percentage of the memory capacity of the node is allocated as memory limits to containers in this Pod. In other words, this is the percentage of a node's memory capacity that the containers on this Pod are allowed to use. Percent The formula used for computing this measure is as follows:

(Memory limits/Memory capacity on node)*100

If the value of this measure exceeds 100%, it means that the Pod is oversubscribing to the node's capacity. In other words, it means that the Pod has been allowed to use more resources than the node's capacity.

Percent_memory_request Indicates what percentage of the total memory capacity of the node is set as memory requests for the containers on this Pod. In other words, this is the percentage of a node's memory capacity that the containers on this Pod are guaranteed to receive. Percent The formula used for computing this measure is as follows:

(Memory requests/Memory capacity on node)*100

Compare the value of this measure across Pods to know which Pod has been guaranteed the maximum memory resources.

Memory_usage Indicates the amount of memory resources used by this Pod. GB The formula used for computing this measure is as follows:

(Memory requests/Memory capacity on node)*100

Compare the value of this measure across Pods to know which Pod has been guaranteed the maximum memory resources.

Memory_utilization Indicates the percentage of memory resources utilized by this Pod. Percent A value close to 100% is indicative of excessive memory usage by a Pod, and hints at a potential memory contention on the node.

A value greater than 100% implies that the Pod has probably over-subscribed to the node's capacity.

NonMemory_limit Indicates the number of containers in this Pod for which memory limits are not set. Number If limit is not set, then it defaults to 0 (unbounded).
NonMemory_request Indicates the number of containers in this Pod for which memory requests are not set. Number In the case that request is not set for a container, it defaults to limit.