eG Monitoring
 

The Inside View System Dashboard

Though the Measures tab page of System Dashboard page provides all the metrics collected from within a particular VM, these metrics are indicative of only the current health of the VM, and enable you to identify only the current issues related to the VM. To obtain a true picture of the health of a VM however, knowledge of the current status of the VM might alone not suffice; in addition to this, you would have to perform the following:

  • Analyze the overall performance of a VM over time to deduce performance trends and past problems;
  • Closely observe the changes in the current and historical usage of each of the virtual resources (CPU/memory/disk), so as to deduce usage patterns and potential resource contentions;
  • Study the network traffic handled and bandwidth used by the VM over time, so that sudden/steady increases in traffic/bandwidth can be promptly detected;
  • Determine the uptime of a VM by observing how long the VM has been up and running and how frequently it was rebooted during a specific time window;
  • To enable you to perform the aforesaid analysis, eG Enterprise offers dedicated System and Network dashboards for every guest operating system on an ESX server. The sections that will follow will help you understand each of these dashboards clearly.

The System Dashboard

Clicking on the System tab page in System Dashboard page will take you to System Dashboard page, which reveals a System Dashboard.

The System Dashboard, using raw and graphically represented data, helps you understand how the critical resources of the guest operating system are being currently utilized, and how usage has changed over time. Besides enabling you to promptly detect over-utilization of resources, this dashboard also helps you accurately determine when such an unhealthy usage trend began and what caused it - is it owing to the improper allocation of virtual resources or the execution of one/more resource-intensive processes on the VM? The contents of the dashboard typically vary based on the Subsystem chosen. By default, the System Dashboard provides an overview of VM performance. Accordingly, the Overview option is chosen from the Subsystem list by default. To zoom into the usage of a particular resource allocated to the VM, you can pick the corresponding option from the Subsystem list (i.e., CPU/Disk/Memory). On the other hand, if you want to analyze the uptime of the VM, choose Uptime as the Subsystem. Each of these subsystems have been dealt with in detail in the sections that will follow.

Overview

To receive a macro view of the health of a VM, select Overview from the Subsystem list. System Overview Dashboard page then appears.

The contents of the Overview dashboard are as follows:

  • The Current System Alerts section indicates the number and type of issues currently encountered by the guest operating system. To know more about these issues, move your mouse pointer over a specific cell in the Distribution section; this will reveal the details pertaining to the current alarms of that type, as depicted by Viewing the current system alerts in the System Overview dashboard page.
  • Clicking on a particular problem's details in System Overview dashboard page will lead you to the problem layer, test, and measure.
  • While the Current System Alerts focuses on current problems, the History of Events section provides a snapshot of problem situations in the past. By default, the bar chart in this section indicates the number and type (i.e., priority) of problems that affected VM operations during the last 24 hours. To know what these issues were, click on a particular bar in the bar chart. History of events of a particular priority page then appears providing elaborate details of the problems of that priority.
  • Click on the Back to Dashboard link at the right, top corner of History of events of a particular priority page to return to the dashboard.
  • Once back in the dashboard, shift your focus to Current System Status section within. The pie chart in this section helps you quickly determine how healthy the guest operating system currently is. The slices in this pie chart typically indicate the percentage of measures that are currently in various states (normal/unknown/major/minor/critical). Clicking on a slice leads you yet again to the EVENT HISTORY page of History of events of a particular priority page.
  • To know the current configuration of the VM, refer to the System Configuration section. This section will appear only if the eG license enables the Configuration Management module. If not, then a Current Layerwise Health section will appear instead. This section displays a bar chart that indicates the percentage of system-related measures mapped to the Virtual Servers layer that are in various states currently; this indicates how problem-prone the VM currently is.
  • To be promptly alerted to any change in the state of critical system-related parameters, refer to the Key Performance Indicators section. This section displays a pre-configured list of key resource usage metrics that govern the overall health of the VM. For every measure, the current value reported by that measure and the measure state will be available. Clicking on a measure name here will take you to the layer model page page that reveals the problem layer and test that reports the measure.
  • Below the Key Performance Indicators section, the dashboard provides a variety of measure graphs and top-n bar charts that facilitate effective analysis of both past and current performance. The graphs section begins with a series of measure graphs that reveal the time-of-day variations in the virtual resource usage during the last 1 hour (by default). In the event of excessive resource usage by the VM, these graphs will help you determine whether the increase in resource consumption happened suddenly or over time. This way, you can accurately figure out when a resource contention actually originated, and then investigate further as to what caused it. Some of these measure graphs may be plotted for multiple descriptors, and may hence appear cluttered. To view these graphs clearly and to analyze their performance implications better, click on the corresponding graph and enlarge it. In the enlarged mode, you can even change the Timeline of the graph, so as understand how performance has varied over a longer period of time. Similarly, you can even change the dimension of the enlarged graph from 3D to 2D, if need be.
  • After the measure graphs, top-n bar charts are available in the dashboard. Each of these bar charts reveal the most CPU-intensive, memory-intensive, and I/O-intensive processes that are currently executing on the VM, thus taking you a step closer to the root-cause of any resource contention that might have occurred currently on the VM. Sometimes, you might want to investigate issues with resource usage that might have occurred in the past, in an effort to zero-in on the process(es) that caused the issue. To achieve this, you need to alter the timeline of the bar charts. For this purpose, click on the required bar chart to enlarge it. An enlarged top-n bar chart will then appear. The enlarged bar chart allows you to change the Timeline, so that you can easily view the top resource-intensive processes that executed on the VM during the specified time period, and instantly detect root-cause.
  • You can even access the detailed diagnosis information that corresponds to your enlarged graph by clicking on the button at the right, top corner of enlarged top-n bar.
  • Detailed diagnosis page will then appear revealing the detailed diagnosis information.

CPU

While the Overview subsystem provides you with quick insights into the overall resource usage situation on the VM, to know how the individual resources are utilized, you would have to select the corresponding option from the Subsystem list. Select CPU as the Subsystem if you want to focus only on the current and historical CPU usage of the VM, and promptly detect potential CPU bottlenecks. Upon selecting this option, Dashboard of the CPU Subsystem will appear.

  • The CPU Usage Summary section of Dashboard of the CPU Subsystem lists the processors supported by your VM and the percentage of CPU utilized by each processor. In the event of excessive CPU usage by the VM, you can use this section to quickly identify the processor that is contributing to this problem condition.
  • Below this section, a collection of measure and top-n bar are available that facilitate effective analysis of CPU usage over time. Using the measure graphs displayed here, you can figure out how CPU was utilized by the VM during the last 1 hour (by default). In the event of unusually high CPU consumption by the VM, these measure graphs will help you understand whether this increase in CPU usage is just a sudden, one-off incident, a phenomenon that occurred frequently during the last hour (by default), or an anomaly that aggravated over time.
  • To change the timeline of the measure graphs, so that usage analysis can be performed over longer time periods, click on the measure graph of interest to you. This will enlarge the measure graph. In the magnified mode, you can modify the Timeline of the graph and its dimension (3D/2D).
  • Besides the measure graphs, the System Dashboard of the CPU Subsystem displays a Top CPU consuming processes table, and a CPU usage by top processes graph. The Top CPU consuming processes table reveals the top CPU consumers on the VM during the last 1 hour (by default). Besides, the table also indicates how high and how low the CPU usage of each process has scaled during the same hour. This way, you can understand whether the high CPU usage of a process was just a sudden spike or a consistent phenomenon.
  • The CPU Usage by Top Processes chart graphically compares the CPU usage of the processes that were executing on the VM during the last 1 hour (by default), and enables the quick and easy identification of the process that is the top CPU consumer. Clicking on this chart enlarges it enabling you to perform the comparative analysis more effectively.
  • You can even click on the icon in enlarged CPU usage by top processes graph to view the detailed diagnosis page, which reveals the top CPU-consuming processes that were executing on the VM during the Timeline specified in detailed diagnosis of the top processes by CPU usage page.

Memory

Be proactively alerted to current and probable memory contentions with the help of the Memory Dashboard that appears when the Memory option is chosen as the Subsystem.

The contents of this dashboard are as follows:

  • Using the Memory Usage Summary section, you can instantly figure out whether the VM in question is currently experiencing a memory crunch or not.
  • If a memory contention is noticed, then the graphs provided in the dashboard will enable you to figure out when the memory drain began, and what caused it. The graphs section typically begins with measure graphs which indicate how the VM has been using its allocated memory resources over the past 1 hour (by default). In the event of excessive memory usage by the VM, you can use these graphs to figure out whether the VM suddenly became memory-hungry, or has been consuming memory excessively for a while. If the memory usage has been consistently high, then you might want to go further back in time - i.e., beyond the default 1 hour period - to determine when exactly the memory depletion began. For this, you will have to change the timeline of that measure graph that will aid your investigation. To alter the timeline of a graph, click on that graph to enlarge it, and then modify its Timeline.
  • On the other hand, if the measure graphs indicate that memory consumption increased only in the last hour, then you might want to know whether any memory-intensive processes were executing on the VM during the same period. To know which processes were executing and which one of them is the leading memory consumer, you can either use the Top Memory Consuming Processes table in the dasboard or the Memory usage by top processes graph. The table primarily reveals how much memory has been utilized, on an average, by each process on the VM during the last 1 hour (by default); in addition, it also displays the minimum and maximum percentage of memory utilized by every process in the same 1 hour period. Besides enabling you to accurately identify the most memory-intensive process on the VM, this table also helps you isolate those processes which have displayed erratic memory usage trends in the last 1 hour. The graph, on the other hand, aids you in visually comparing the memory usage of the processes on the VM during the last hour, and instantly nailing the process that is consuming the maximum memory resources.
  • If too many processes are executing on the VM, the graph is bound to appear cluttered. You can therefore, click on the graph to expand it and view the values plotted clearly.
  • You can, if need be, alter the Timeline of the enlarged graph.
  • You can even click on the icon in the detailed diagnosis page to view the detailed diagnosis page, which lists the the Process ID (PID) of the top-10 memory consuming processes that were executing on the VM during the specified Timeline, and the memory used by each process.

Disk

To closely observe disk usage and disk activity on a VM and to proactively identify partitions that are running out of space, select Disk from the Subsystem list. The Disk Dashboard of appears.

The contents of this dashboard are as follows:

  • From the Disk Usage Summary section in Dashboard of the Disk Subsystem page, you not only understand how many disk partitions are operating on the VM, but also receive real-time updates on the space usage per partition; this way, you can quickly isolate those partitions that are currently running out of space.
  • In addition to the usage summary, graphs on usage and disk activity are also available in the dashboard. The measure graphs in Dashboard of the Disk Subsystem page depict the time-of-day variations in disk usage and disk activity over a default period of 1 hour. Using these graphs, the following can be ascertained:
    • How has disk space been utilized across the virtual disks over the last 1 hour? Is any disk exhibiting excessive usage trends? If so, which disk partition is it, and when during the last hour did this trend begin?
    • How busy were the virtual disks during the last hour? Which disk was the busiest?
    • Are all virtual disks able to process requests quickly, or are too many I/O requests currently enqueued for a disk? If so, which disk is it?
    • Was any sudden/steady increase noticed in the length of the I/O request queue of a disk during the past hour?
  • However, analysis of an hour's data might not be enough at all times. For instance, the free space on a virtual disk might have decreased gradually, but steadily over a time span of say 1 week. This anomaly might not be evident in a 1-hour graph. You need to plot a graph for a week to figure this out. Therefore, to facilitate usage analysis over longer time windows, eG Enterprise allows you to change the timeline of the measure graphs. For this, simply click on the measure graph of interest to you to enlarge it. In the magnified mode, you not only get to alter the graph Timeline, but can also study the graph more clearly.
  • If the default measure graphs reveal sporadic spikes/a consistent rise in the disk activity during the last hour, then, it is imperative that you determine the reason for the same. One of the key reasons for an inexplicable rise in disk activity is the execution of one/more I/O-intensive processes on the VM during the default period (i.e., 1 hour). To know what these processes are and to identify the most I/O-intensive process, refer to the Disk activity - Top processes table or the Disk busy by top processes graph in Dashboard of the Disk Subsystem page. The table lists those processes on the VM, which have, on an average generated high I/O activity on the virtual disks over the last 1 hour (by default). In addition, the table also indicates the minimum and maximum rate of disk activity generated by each of the top processes during the same 1 hour period. This not only helps you accurately identify the most I/O-intensive process on the VM, but also enables you to ascertain whether disk activity has varied dramatically or marginally during the last hour.
  • The Disk busy by top processes graph visually represents the Avg disk activity data available in the Disk activity - Top processes table, and thus aids you in quickly identifying that process which has generated the maximum disk I/O. To view this graph clearly, click on it to enlarge it.
  • You can change the Timeline of the enlarged graph. To view the detailed diagnosis information that has been plotted in the graph, click on the icon in enlarged Disk busy by top processes graph page. Detailed diagnosis page then appears providing the detailed measures on disk activity for the time period chosen in the enlarged graph.

Uptime

If Uptime is chosen as the Subsystem, the resulting dashboard enables you to promptly detect unscheduled reboots and determine how long the VM has been down.

The contents of this dashboard are as follows:

  • The System Uptime section indicates the total time for which the VM has been up since its last reboot.
  • The measure graphs that appear below indicate how long during the last hour (by default) the system has been up, whether the reboots scheduled for the system have occurred during the last hour or not, and how long during the last measurement period was the system up. Using these measure graphs, you can accurately determine reboot failures/prolonged VM downtime.
  • To alter the timeline of a graph, click on it to enlarge it, and then change the Timeline in the enlarged mode.