eG Monitoring
 

The Inside View Network Dashboard

Though the Measures tab page of System Dashboard page provides all the metrics collected from within a particular VM, these metrics are indicative of only the current health of the VM, and enable you to identify only the current issues related to the VM. To obtain a true picture of the health of a VM however, knowledge of the current status of the VM might alone not suffice; in addition to this, you would have to perform the following:

  • Analyze the overall performance of a VM over time to deduce performance trends and past problems;
  • Closely observe the changes in the current and historical usage of each of the virtual resources (CPU/memory/disk), so as to deduce usage patterns and potential resource contentions;
  • Study the network traffic handled and bandwidth used by the VM over time, so that sudden/steady increases in traffic/bandwidth can be promptly detected;
  • Determine the uptime of a VM by observing how long the VM has been up and running and how frequently it was rebooted during a specific time window;
  • To enable you to perform the aforesaid analysis, eG Enterprise offers dedicated System and Network dashboards for every guest operating system on an ESX server. The sections that will follow will help you understand each of these dashboards clearly.

The Network Dashboard

The Network Dashboard provides you with an overview of the network and TCP traffic to and from a VM, and in the process reveals the following:
  • Is network load balanced across all the interfaces supported by the VM, or is any interface currently experiencing excessive incoming/ougoing traffic?
  • Which network interface is utilizing the maximum bandwidth and why?
  • Are too many TCP connections currently alive on the VM?
  • Were any connections dropped?
  • Did any TCP connection attempt fail?
  • Were there too many TCP retransmits?
  • Like the System dashboard, the contents of the Network dashboard too are primarily governed by the choice of Subsystem. By default, Overview is chosen as the network Subsystem. If need be, you can choose the Network or Tcp Subsystems instead. The sections that will follow will focus on each of these Subsystems.

Overview

In the Overview mode, the dashboard provides an all-round view of both the network and TCP health of the VM, enabling you to ascertain in a single glance:
    • The current status of the network and TCP connections to the VM;
    • Issues (if any) currently affecting network/TCP performance;
    • How network traffic, bandwidth usage, and overall TCP activity on the VM varied over the last 1 hour (by default), and what were the peak points/bottleneck areas

The contents of the Overview dashboard have been discussed hereunder:

  • The Current Network Alerts sections reveals the number and type of network-related issues currently affecting VM health. By moving your mouse pointer over a valid number against Distribution, you can view the details of the current network alarms pertaining to the corresponding severity (not shown in Network Overview Dashboard page). Clicking on any alarm information displayed in this section will lead you to the exact test and measure that reported the problem represented by the alarm.
  • Once back in the Overview dashboard, you can receive a fair idea of how problem-prone the VM has been in the past, by taking a look at the History of Events section (not shown in Network Overview Dashboard page). For every alarm priority, this section indicates the number of network/TCP-related problems that were experienced by the VM during the last 24 hours. Alongside this depiction, you will also find a table that reveals the average, maximum, and minimum duration for which a problem remained unresolved during the last 24 hours; from this information, you can infer how long your administrative staff took to turnaround, when faced with issues - this enables you to judge the efficiency of your staff better. For more details pertaining to the historical issues, click on a bar in the History of Events bar chart. History of network events page will then appear providing the complete event history.
  • Below the History of Events section is the Current Network Status pie chart. This pie chart is a good indicator of the current network health of the VM, as it reveals the percentage of network-related measures that are currently in an abnormal state. Clicking on a slice in this pie chart will lead you to History of network events page again.
  • Similarly, you have the Current Layerwise Health bar graph that indicates how healthy the network-related measures reported by the Virtual Servers layer of the VM currently are. If this bar chart and the Current Network Status pie chart reveal poor network health, then, you might want to know what is causing this - the Key Performance Indicators section reveals just that! By displaying the current state and the minimum, maximum, and average values of certain critical factors that influence network performance, the Key Performance Indicators section enables you to determine whether any of these factors could have contributed to the problem at hand. To know how any of these factors are currently performing, click on that particular key performance indicator listed in this section. Clicking on a Key Performance Indicator page will then appear.
  • For historically analyzing network/TCP health and for detecting exactly when and why bottlenecks surfaced, use the measure graphs provided by the dashboard. These graphs, by default, track the variations in network performance over the last 1 hour. In the event of a network issue with the VM, you can use these graphs to determine whether the issue occurred suddenly or is only the climax of a consistent deterioration in network performance.
  • Click on a graph to enlarge it. In the magnified mode, you can even change the Timeline of the graph, so as to obtain deeper insights into the past performance of the VM. In the case of graphs that plot values for multiple descriptors, you can pick a TOP-N or LAST-N option from the Show list and thus focus on the best/worst performers alone. You can even change the dimension of the graph from the defalt 3D to 2D.

Network

To focus on network performance, select Network as the Subsystem. Dashboard of the Network Subsystem page will then appear displaying time-of-day graphs that enable you to perform the following with ease:

  • Track the variations in network traffic to and from the VM over the default period of 1 hour;
  • Isolate network delays by observing the changes in the length of the output queue during the last 1 hour (by default);
  • Detect when during the last hour (by default) packet errors occurred and why;
  • You can click on any of the graphs here to enlarge it, and can even alter the Timeline of that graph in its magnified mode.

Tcp

Selecting Tcp as the Subsystem results in the display of time-of-day graphs on the Tcp connectivity of the chosen VM; these graphs, which are plotted for a default period of 1 hour, reveal the following:

  • How has the Tcp load on the VM varied during the last 1 hour (by default)? Were there sporadic surges in load during the given period - if so, when did these spikes occur? If not, was a more consistent increase in load noticed during the last hour?
  • Were too many Tcp connections suddenly dropped from the listen queue? Could it because of a Tcp overload on the VM?
  • Did any Tcp connections fail during the default period? If so, when?
  • Was the Tcp retransmit ratio optimal during the last 1 hour?

You can click on any of the graphs here to enlarge it, and can even alter the Timeline of that graph in its magnified mode.