eG Monitoring
 

Measures reported by AzrBatAccTest

Use Azure Batch to run large-scale parallel and high-performance computing (HPC) batch jobs efficiently in Azure. Azure Batch creates and manages a pool of compute nodes (virtual machines), installs the applications you want to run, and schedules jobs to run on the nodes.

The steps in a common Batch workflow, with a client application or hosted service using Batch to run a parallel workload is given below:

  1. Upload input files and the applications to process those files to your Azure Storage account.

  2. Create a Batch pool of compute nodes (VMs) in your Batch account, a job to run the workload on the pool, and tasks in the job. .When you add tasks to a job, the Batch service automatically schedules the tasks for execution on the compute nodes in the pool.

  3. Each task then downloads the input files that it needs to process, and the application that should process the files , to the assigned node. When the downloads from Azure Storage complete, the task executes on the assigned node.

  4. Monitor task execution by querying Batch.

  5. As tasks complete, they upload their result data to Azure Storage.

  6. When your monitoring detects that the tasks in your job have completed, your client application or service can download the output data for further processing.

Because it can quickly and efficiently process large volumes of data, Azure Batch is the right platform for building SaaS applications or client apps where large-scale execution is required - eg., VFX and 3D image rendering, Media transcoding, Software testing etc. However, if when using Azure Batch for such purposes, a task fails, or one/more compute nodes become unusable and exit the pool, of if the Batch service itself encounters errors, the pace of job processing will drop. This in turn, can degrade the performance of the dependent SaaS/client apps, ultimating affecting user experience with those apps.

Also, your batch workloads are often processed according to default limits and quotas pre-defined at the subscription / batch account level. Before designing/scaling up workloads therefore, it is important for you to know the current quota definitions and whether/not they work well for you. Improperly set quotas/limits can impede batch job processing.

To ensure that Azure Batch provides robust processing services at all times, you need to constrantly track the status of the service and that of the tasks and compute nodes it manages, detect abnormalities on-the-fly, and resolve them before the UX is impacted. This is where the AzrBatAccTest helps!

This test monitors the Azure Batch Service for each resource group in the target subscription, and reports the current status of the service. Alerts are sent out if the service is in an abnormal state. The test also tracks the progress of tasks and the status of compute nodes in the pool, and notifies administrators if there are failures. By shedding light on current/potential snags in batch processing, the test prompts administrators to initiate corrective/pre-emptive action immediately, so that the dependent applications do not slow down. Additionally, the test also reports the count of compute nodes and cores in use, so you can quickly determine if the quota specifications for the same are being violated. This way, the test prompts you to increase/decrease the quotas/limits, so that they match your business workload and processing requiremements.

Outputs of the test : One set of results for the Batch service of each resource group in the target subscription

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Status Indicates the current status of this batch service.   The values reported by this measure and its numeric equivalents are mentioned in the table below:

Measure Value Numeric Value
Succeeded 1
Updating 2
Error 3


Note:

By default, this measure reports the Measure Values listed in the table above to indicate the current status of the batch service. In the graph of this measure however, the same is represented using the numeric equivalents only.

Use the detailed diagnosis of this measure to know all about the batch service. The details displayed as part of detailed diagnostics include the pool allocation mode (batch service or user subscription mode), and the quota settings.
Core_cnt Indicates the total number of cores used by this batch service. Number  
Tot_node_cnt Indicates the total number of compute nodes used by this batch service. Number  
Tot_Low_Prirt_core_cnt Indicates the number of low priority nodes used by this batch service. Number Azure Batch offers low-priority virtual compute nodes (VMs) to reduce the cost of Batch workloads. Low-priority VMs make new types of Batch workloads possible by enabling a large amount of compute power to be used for a very low cost.

Low-priority VMs take advantage of surplus capacity in Azure. When you specify low-priority VMs in your pools, Azure Batch can use this surplus, when available.

The tradeoff for using low-priority VMs is that those VMs may not always be available to be allocated, or may be preempted at any time, depending on available capacity. For this reason, low-priority VMs are most suitable for batch and asynchronous processing workloads where the job completion time is flexible and the work is distributed across many VMs.
Creating Indicates the number of compute nodes in the pools managed by this batch service that are in the Creating status currently. Number Nodes that are in the Creating state are Azure-allocated VMs that have not yet started to join a pool.
Starting Indicates the number of compute nodes in the pools managed by this batch service that are in the Starting state currently. Number A node on which the Batch service is starting is said to be in the Starting state.
Waitng_for_strt_tsk Indicates the number of compute nodes in the pools managed by this batch service that are in the WaitingForStartTask state currently. Number A node is said to be in the WaitingForStartTask state if the start task has started running on that node, but waitForSuccess is set and the start task has not completed.
Strt_tsk_faild Indicates the number of compute nodes in the pools managed by this batch service that are in the StartTaskFailed state currently. Number A node is said to be in the StartTaskFailed state if start task failed on that node and exhausted all retries, and on which waitForSuccess is set on the start task Such a node is not usable for running tasks. Ideally therefore, the value of this measure should be 0.
Idle Indicates the number of compute nodes in the pools managed by this batch service that are in the Idle state currently. Number An available compute node that is not currently running a task is said to be om an Idle state.
Offline_status Indicates the number of compute nodes in the pools managed by this batch service that are in the Offline state currently. Number An offline node is one that Batch cannot use to schedule new tasks.
Rebooting Indicates the number of compute nodes in the pools managed by this batch service that are in the Rebooting state currently. Number A node that is restarting is a Rebooting node.
Reimaging Indicates the number of compute nodes in the pools managed by this batch service that are in the Reimaging state currently. Number A Reimaging node is one on which the operating system is being reinstalled.
Running Indicates the number ofcompute nodes in the pools managed by this batch service that are in the Running state currently. Number A Running node is one that is running one or more tasks (other than the start task).
Leaving_pool Indicates the number of compute nodes in the pools managed by this batch service that are in the LeavingPool state currently. Number A node assumes the LeavingPool state if it is leaving the pool, either because the user explicitly removed it or because the pool is resizing or autoscaling down.
Unusable Indicates the number of compute nodes in the pools managed by this batch service that are in the Unusable state currently. Number A node that cannot be used for task execution because of errors switches to the Unusable state. Ideally therefore, the value of this measure should be 0.
Preempted Indicates the number of compute nodes in the pools managed by this batch service that are in the Preempted state currently. Number A Preempted node is a low-priority node that was removed from the pool because Azure reclaimed the VM. A preempted node can be reinitialized when replacement lowpriority VM capacity is available.
Tsk_strt_evt Indicates the number of ‘task start events’ that were emitted by this batch service. Number A ‘Task start’ event is emitted once a task has been scheduled to start on a compute node by the scheduler. Note that if the task is retried or requeued this event will be emitted again for the same task, but the retry count and system task version will be updated accordingly.
Tsk_cmplt_evt Indicates the number of ‘task complete events’ that were emitted by this batch service. Number A ‘Task complete’ event is emitted once a task is completed, regardless of the exit code. This event can be used to determine the duration of a task, where the task ran, and whether it was retried.
Tsk_fail_evt Indicates the number of ‘task fail events’ that were emitted by this batch service. Number The ‘Task fail’ event is emitted when a task completes with a failure. Currently all nonzero exit codes are considered failures. This event will be emitted in addition to a task complete event and can be used to detect when a task has failed. Ideally therefore, the value of this measure should be 0.
Pl_creat_evt Indicates the number of ‘pool create events’ that were emitted by this batch service. Number The ‘Pool create’ event is emitted once a pool has been created.
Pl_resz_strt_evt Indicates the number of ‘pool resize events’ that were emitted by this batch service. Number The ‘Pool resize start’ event is emitted when a pool resize has started. Such an event is typically triggered if the target size of the pool is greater than 0 compute nodes.
Pl_resz_cmplt_evt Indicates the number of ‘pool resize complete events’ that were emitted by this batch service. Number The ‘Pool resize complete’ event is emitted when a pool resize has completed or failed.
Pl_dlt_strt_evt Indicates the number of ‘pool delete start events’ that were emitted by this batch service. Number The ‘Pool delete start’ event is emitted when a pool delete operation has started.
Pl_dlt_cmplt_evt Indicates the number of ‘pool delete complete events’ that were emitted by this batch service. Number The ‘Pool delete complete’ event is emitted when a pool delete operation has completed.