eG Monitoring
 

Measures reported by AWSEMapReduceTest

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

The central component of Amazon EMR is the cluster. A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a node. Each node has a role within the cluster, referred to as the node type. Amazon EMR also installs different software components on each node type (master node, core node, task node), giving each node a role in a distributed application like Apache Hadoop.

Amazon EMR service architecture consists of several layers, each of which provides certain capabilities and functionality to the cluster.

  • Storage layer: The storage layer includes the different file systems that are used with your cluster. There are several different types of storage options such as, Hadoop Distributed File System (HDFS), EMR File System (EMRFS), and the Local File System.

  • Cluster resource management layer: The resource management layer is responsible for managing cluster resources and scheduling the jobs for processing data.

  • Data processing frameworks layer: The data processing framework layer is the engine used to process and analyze data. The main processing frameworks available for Amazon EMR are Hadoop MapReduce and Spark.

  • Applications and Programs: Amazon EMR supports many applications, such as Hive, Pig, and the Spark Streaming library to provide various capabilities.

When a cluster is launched, you choose the frameworks and applications to install for your data processing needs. Once the chosen applications are installed, you should define the work to be done by the cluster. This can be done using Map and Reduce functions, using the EMR console / EMR API / AWS CLI, using the Hadoop API, or with the help of the interface provided by Hive or Pig. Then, the cluster processes data. To enable data processing in the cluster, you can either submit jobs or queries directly to the applications that are installed on your cluster or run steps in the cluster. Once data processing is complete, the cluster automatically shuts down (unless, auto-terminate is disabled).

If data processing is slow in a cluster, then administrators should be able to quickly and accurately tell what could be causing the slowness - map tasks? reduce tasks? storage contention at the cluster? under-utilization of task nodes and core nodes? To proactively detect processing bottlenecks in a cluster and isolate the root-cause of the bottleneck, administrators should periodically run the AWSEMapReduceTest test-

This test automatically discovers the clusters on Amazon EMR and tracks the status of each cluster and the progress of cluster activities. In the process, the test turns the spot light on clusters that are processing data slowly, and provides useful pointers to what could be slowing down processing.

Optionally, you can configure this test to report metrics for every job that a cluster runs. This will enable administrators to identify those jobs that could be slowing down data processing and what tasks are performed by the slow jobs - map tasks? or reduce tasks?

Outputs of the test : One set of results for each cluster / job.

First-level descriptor: AWS Region

Second-level descriptor: JobFlowID or ClusterID / JobId depending upon the option chosen from the EMR FILTER NAME parameter of this test.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Is_idle Indicates whether/not this cluster is currently idle.   This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ’JobFlowId’.

This measure reports the value Yes if the cluster is idle, and No if it is not idle. A cluster is said to be idle if it is no longer performing work, but is still alive and accruing charges.

The numeric values that correspond to these measure values are listed in the table below:

Measure Value Numeric Value
Yes 1
No 0

Note:

By default, this measure reports the above-mentioned Measure Values to indicate whether/not a cluster is idle. In the graph of this measure however, the same is indicated using the numeric equivalents.

Jobs_failed Indicates the number of jobs that have failed in this cluster. Number This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ’JobFlowId’.

Ideally, the value of this measure should be 0.
Jobs_running Indicates the number of jobs in this cluster that are currently running. Number This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ’JobFlowId’.
Map_task_run By default, this measure reports the total number of map tasks that are currently running in this cluster.

If the EMR FILTER NAME parameter is set to JobId, then this measure will report the number of map tasks that are currently running for this job.
Number Hadoop MapReduce is one of the main data processing frameworks available for Amazon EMR. Hadoop MapReduce is an open-source programming model for distributed computing. It simplifies the process of writing parallel distributed applications by handling all of the logic, while you provide the Map and Reduce functions. The Map function maps data to sets of key-value pairs called intermediate results. To put it simply, the Map procedure performs filtering and sorting (such as sorting students by first name into queues, one queue for each name).
Map_task_remain By default, this measure reports the total number of map tasks that are still to be run in this cluster.

If the EMR FILTER NAME parameter is set to JobId, then this measure will report the number of map tasks that are still to be run for this job.
Number A remaining map task is one that is not in any of the following states: Running, Killed, or Completed.

A low value is desired for this measure. If the value of this measure is abnormally high for a cluster, you may want to change the EMR FILTER NAME of this test to JobId and see which job is still to run many map tasks. Such jobs could be slowing down the cluster.
Map_slots_open Indicates the unused map task capacity of this cluster. Number This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.

This is calculated as the maximum number of map tasks for a given cluster, less the total number of map tasks currently running in that cluster.

A high value indicates that the cluster can take more load than it is currently. A very low value indicates that the cluster is already under duress as it is running too too many map tasks. The cluster may not be able to run any more map tasks until the ones running currently are either killed or completed. This can choke the cluster and slow down processing. Under such circumstances, you may want to increase the maximum number of map tasks the cluster can run to increase its capacity.
Reduce_Task_run By default, this measure reports the total number of reduce tasks that are currently running in this cluster.

If the EMR FILTER NAME parameter is set to JobId, then this measure will report the number of map tasks that are currently running for this job.
Number Hadoop MapReduce is one of the main data processing frameworks available for Amazon EMR. Hadoop MapReduce is an open-source programming model for distributed computing. It simplifies the process of writing parallel distributed applications by handling all of the logic, while you provide the Map and Reduce functions. The Reduce function combines the intermediate results (i.e., the key-value pairs)that are provided by the Map function, applies additional algorithms, and produces the final output. To put it simply, while the Map procedure performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), the Reduce procedure performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).
Reduce_Task_remain By default, this measure reports the total number of reduce tasks that are still to be run in this cluster.

If the EMR FILTER NAME parameter is set to JobId, then this measure will report the number of reduce tasks that are still to be run for this job.
Number A remaining reduce task is one that is not in any of the following states: Running, Killed, or Completed.

A low value is desired for this measure. If the value of this measure is abnormally high for a cluster, you may want to change the EMR FILTER NAME of this test to JobId and see which job is still to run many reduce tasks. Such jobs could be slowing down the cluster.
Reduce_slots_open Indicates the unused reduce task capacity of this cluster. Number This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.

This is calculated as the maximum number of reduce tasks for a given cluster, less the total number of reduce tasks currently running in that cluster.

A high value indicates that the cluster can take more load than it is currently. A very low value indicates that the cluster is already under duress as it is running too too many reduce tasks. The cluster may not be able to run any more reduce tasks until the ones running currently are either killed or completed. This can choke the cluster and slow down processing. Under such circumstances, you may want to increase the maximum number of reduce tasks the cluster can run to increase its capacity.
Core_node_pending Indicates the number of core nodes in this cluster waiting to be assigned. Number This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.

A core node is slave node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster.

All the core nodes requested may not be immediately available. Until a core node is assigned, tasks cannot be run on it. Therefore, a high value for this measure is often indicative of many pending tasks/requests. This can also slow down data processing in a cluster.
Core_node_running Indicates the number of core nodes in this cluster that are working currently. Number This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.
Live_datanode Indicates the percentage of data nodes that are receiving work from Hadoop. Percent This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.

Core nodes run the Data Node daemon to coordinate data storage as part of the Hadoop Distributed File System (HDFS). They also run the Task Tracker daemon and perform other parallel computation tasks on data that installed applications require. For example, a core node runs YARN NodeManager daemons, Hadoop MapReduce tasks, and Spark executors.

From the value of this measure, you can infer how many of the Running core nodes in the cluster are running Hadoop tasks and how many are engaged in other non-Hadoop tasks.
TaskNodes_pending Indicates the number task nodes in this cluster that are currently waiting to be assigned. Number This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.

A task node is a slave node with software components that only run tasks. Task nodes are optional.

All the task nodes requested may not be immediately available. Until a task node is assigned, tasks cannot be run on it. Therefore, a high value for this measure is often indicative of many pending tasks/requests. This can also slow down data processing in a cluster.
TaskNodes_running Indicates the number task nodes in this cluster that are currently running. Number  
Live_task_trackers Indicates the percentage of task trackers that are functional. Percent This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.
S3_bytesRead Indicates the amount of data that this cluster read from Amazon S3. KB These measures are reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.

Amazon S3 can be used as the file system for a cluster. EMRFS (EMR File System) is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3.

These measures indicate the I/O generated by a cluster when reading from and writing into Amazon S3, using EMRFS. Compare the value of this measure across clusters to know which cluster is generating the maximum I/O.
S3_bytesWritten Indicates the amount of data written to Amazon S3. KB
HDFS_byte_read Indicates the amount of data read from HDFS(Hadoop Distributed File System) by this cluster. KB This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.
HDFS_byte_writen Indicates the amount of data written into HDFS(Hadoop Distributed File System) by this cluster. KB This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.

HDFS can be used as a file system for an Amazon EMR cluster. HDFS is a distributed, scalable file system for Hadoop. HDFS distributes the data it stores across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails. HDFS is ephemeral storage that is reclaimed when you terminate a cluster. HDFS is useful for caching intermediate results during MapReduce processing or for workloads that have significant random I/O.

These measures indicate the I/O generated by a cluster when reading from and writing into HDFS. Compare the value of this measure across clusters to know which cluster is generating the maximum I/O.
HDFS_utilization Indicates the percentage of HDFS storage that this cluster is currently utilizing. Percent This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.
Missing_blocks Indicates the number of blocks in which the HDFS storage used by this cluster has no replicas. Number This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.
Total_load Indicates the total number of concurrent data transfers made by this cluster. Number This measure is reported only for a cluster - i.e., only if the EMR FILTER NAME parameter is set to ‘JobFlowId’.
HB_backup_failed Indicates whethe/not the last backup failed for this cluster.   This measure is reported only for HBase clusters.

HBase is an open source, non-relational, distributed database developed as part of the Apache Software Foundation's Hadoop project.

With HBase on Amazon EMR, you can also back up your HBase data directly to Amazon Simple Storage Service (Amazon S3). If this backup fails for an HBase cluster, then this measure will report the value Yes. If the backup is successful, this measure will report the value No.

The numeric values that correspond to these measure values are listed in the table below:

Measure Value Numeric Value
Yes 1
No 0

Note:

By default, this measure reports the above-mentioned Measure Values to indicate whether/not an HBase backup failed. In the graph of this measure however, the same is indicated using the numeric equivalents only.

MostRecnt_back_durtn Indicates the amount of time it took the previous backup of this cluster to complete. Mins This measure is reported only for HBase clusters.

This metric is set regardless of whether the last completed backup succeeded or failed. While the backup is ongoing, this metric returns the number of minutes after the backup started. A very high value for this measure can therefore indicate a backup delay or a failure.
HBTS_suces_backup Indicates the number of elapsed minutes after the last successful HBase backup started on this cluster. Mins This measure is reported only for HBase clusters.