|
Measures reported by KuberJobTest
A Job creates one or more Pods and ensures that a specified number of them successfully terminate. As pods successfully terminate, the Job tracks how many Pods completed their tasks successfully. When a specified number of successful completions is reached, the task (ie, Job) is complete.
Jobs are useful for large computation and batch-oriented tasks. Jobs can be used to support parallel execution of Pods. You can use a Job to run independent but related work items in parallel: sending emails, rendering frames, transcoding files, scanning database keys, etc.
In the real world, failure of such tasks can degrade the performance of business-critical applications managed by the Kubnernetes system. Likewise, delays in Job execution can also significantly delay the delivery of key business services that overlay the Kubernetes cluster. To ensure peak application/service performance at all times, it is imperative that administrators track the status and duration of each Job that is run on Kubernetes, promptly capture Job failures and slowness, rapidly determine the reason why a Job failed, and swiftly fix it. This is where the KuberJobTest test helps!
This test auto-discovers the namespaces configured in the Kubernetes system, and for each namespace, reports the count of Jobs in different operational states. In the process, the test brings failed and slow Jobs to light. Detailed diagnostics of the test describes the failed and slow Jobs and also provides the reason why Jobs failed. Administrators can use this information to effectively troubleshoot the failure. Additionally, the test reports the status of Pods created by the Jobs, and alerts administrators if any Job resulted in Pod failures.
Outputs of the test: One set of results for each namespace in the Kubernetes cluster being monitored.
The measures made by this test are as follows:
| Measurement |
Description |
Measurement Unit |
Interpretation |
| Completed_job |
Indicates the number of Jobs in this namespace that have completed execution. |
Number |
A non-parallel Job is one that creates only one Pod. Such a Job is said to have completed if that Pod terminates successfully. On the other hand, a parallel Job is one that creates multiple Pods. In the case of such Jobs, you need to specify the desired number of completions using the completions field in your Job specification. A parallel Job is said to have completed only if the desired number of Pods terminate successfully. A high value is desired for this measure. |
| Failed_job |
Indicates the number of Jobs in this namespace that failed. |
Number |
A Job is said to have failed if the specified number of Pods could not complete the tasks.
By default, a Job will run uninterrupted unless a Pod fails (restartPolicy=Never) or a Container exits in error (restartPolicy=OnFailure). At which point, the Job will retry Pod creation. However, there are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Once .spec.backoffLimit has been reached the Job will be marked as failed and any running Pods will be terminated.
Another way to fail a Job is by setting an active deadline. Do this by setting the .spec.activeDeadlineSeconds field of the Job to a number of seconds. The activeDeadlineSeconds applies to the duration of the Job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded.
Note that a Job’s .spec.activeDeadlineSeconds takes precedence over its .spec.backoffLimit. Therefore, a Job that is retrying one or more failed Pods will not deploy additional Pods once it reaches the time limit specified by activeDeadlineSeconds, even if the backoffLimit is not yet reached.
Ideally, the value of this measure should be 0. If the measure reports a non-zero value, then you can use the detailed diagnosis of this measure to know which Jobs failed and why. |
| Running_pods |
Indicates the number of Pods created by Jobs in this namespace, which are currently in the Running state. |
Number |
If a Pod is in the Running state, it means that the Pod has been bound to a node, and all of the Containers have been created. At least one Container is still running, or is in the process of starting or restarting. |
| Failed_pods |
Indicates the number of Pods created by Jobs in this namespace, which are currently in the Failed state. |
Number |
If a Pod is in the Failed state, it means that all Containers in the Pod have terminated, and at least one Container has terminated in failure. That is, the Container either exited with non-zero status or was terminated by the system. |
| Succeeded_pods |
Indicates the number of Pods created by Jobs in this namespace, which are currently in the Succeeded state. |
Number |
If a Pod is in the Succeeded state, it means that all Containers in the Pod have terminated in success, and will not be restarted. |
| Longest_job |
Indicates the number of Jobs in this namespace that have been running for a duration greater than the value of the JOB AGE SECONDS parameter. |
Number |
Ideally, the value of this measure should be 0. If this measure reports a non-zero value, then use the detailed diagnosis of this measure to know which Jobs are executing for a long time. |
| Active_cronJob |
Indicates the number of cron Jobs that are currently active in this namespace. |
Number |
A Cron Job creates Jobs on a time-based schedule. One CronJob object is like one line of a crontab (cron table) file. It runs a Job periodically on a given schedule, written in Cron format. |
|