|
Configuring Thresholds
Thresholds govern the state of a measure. A threshold is characterized by an upper and/or lower limit of performance for the chosen measure. Whenever the threshold is violated, the state of the corresponding measure becomes ‘abnormal’.
The CONFIGURE THRESHOLDs page appears when you click on a measure from the
- DEFAULT THRESHOLDS page
- SPECIFIC THRESHOLDS page
(or)
- AGENT - THRESHOLDS page
To configure thresholds, you need to follow the below-mentioned steps:
- Choosing a configuration type - Default/Specific
- Selecting a thresholding policy
- Configuring multiple thresholds
- Picking the alarm policy
Let us now discuss each of the step mentioned above in detail.
- Choosing a threshold configuration type - Default/Specific
The first step towards configuring thresholds is to pick a threshold configuration type - do you want to configure default thresholds? or specific thresholds? Selecting either the Default option or the Specific option from the Thresholds menu sequence will enable the administrator to configure the thresholds.
To know more about the default thresholds, click here.
To know more about specific thresholds, click here.
To know more about group thresholds, click here.
- Selecting a threshold policy
The next step towards configuring the thresholds is to select an appropriate threshold policy. Basically, the state of a measurement is determined based on threshold settings that eG Enterprise uses. eG Enterprise supports the following threshold policies:
- Static thresholding
- Automatic thresholding
- Auto-Static thresholding
- None
- Static thresholding policy
For many metrics, thresholds can be set statically. For instance, based on the service level expectations and agreements, IT managers can set thresholds for metrics such as network availability, CPU usage, and latency. Application availability and response time can also be handled in the same manner. For example, availability should be 100% whenever the metric is measured. If not, a violation should be detected. Likewise, a network latency of several seconds is usually an indicator of a problem, no matter what time of day the measurement is made at.
To enable administrators to set static baselines for time-invariant measures such as the ones discussed above, the eG Enterprise system includes the static thresholding policy.
To illustrate how static thresholding works, consider the example of the CPU utilization of a host. The CPU utilization measure should never exceed a prescribed limit. Therefore, static threshold limits have to be explicitly defined for the CPU utilization measure. The measure graph which when clicked depicts the static threshold values of the CPU utilization measure and its actuals.
To set the static threshold for a chosen measure in the CONFIGURE THRESHOLDS page that appears, select the Specific Values option from the list box against Static option and proceed to provide the threshold values of your choice. Refer to the sections below to know more about how to set the static thresholds.
- Automatic thresholding policy
In infrastructures where a metric varies with time, a static threshold value cannot serve as a reliable basis for judging performance. For example, consider a web server hosting a web site. The number of TCP connections to the web site could be rather high on a particular day and low on another. Similarly, it could be high during the working hours and low during the nights. In such situations where measurement values change with the time of the day, it is very difficult to set accurate maximum and minimum limits manually. In such cases, the threshold value for this metric also has to be time variant.
Even when a metric is not time variant, its value may change from one server to another. For example, a high-end datacenter server may be able to handle hundreds of users, whereas a low-end standard server may be able to handle only a few tens of servers. In such cases too, it is extremely laborious and time consuming to determine what the normal values are for each and every server.
To handle such situations, eG Enterprise includes an automatic thresholding capability. Using past history of the values of the metric, eG Enterprise uses tried and tested statistical quality control techniques to analyze past values of the metrics and to automatically set the upper and lower bounds for each of the metrics, using the historical data. In this approach, for example, the threshold values for a metric between 9am-10am tomorrow are based on the value of the metric for the same time period over the past days (the number of days to be looked at in the past is configurable).
Note:
In MANAGER_SETTINGS section of the file <EG INSTALL DIR>/manager/config/eg_db.ini, a variable “ThresholdCheckPeriod” exists. The value of this variable defines how far back the manager will check for past history when computing automatic thresholds for a measurement. The default value of this variable is 14 days (i.e., 2 weeks). You can change this value, if required.
With eG's auto-thresholding capability, like the metric value, the threshold also is time varying. Whenever a deviation from this auto baseline (threshold) is detected, an alert is triggered. Since the baseline is set automatically, using this technique ensures that administrators are informed of problems well before they become critical enough to impact the end user experience.
Automatic thresholding is ideal for time varying metrics such as number of requests to a web server, the workload on a database server, queue lengths of requests waiting for processing, etc.
The measure graphs provided by eG Enterprise's monitor interface can bring out the differences between static and automatic thresholding, more clearly. The graph in Figure 8 depicts the threshold limits that were automatically assigned to the Current connections measure. Notice that the statistical data is very periodic and the threshold that is automatically computed by eG Enterprise follows the same pattern as the measurement values.
To set the automatic threshold for a chosen measure in the CONFIGURE THRESHOLDS page that appears, select the Specific Values option from the list box against Automatic option and proceed to set the desired values by adjusting the sensitivity sliders. Refer to the sections below to know more about the computation of the automatic threshold values.
- Auto-static combination threshold policy
Automatic thresholds are ideal for metrics that are time variant. Often, the same metric may vary significantly from one server to another and from time to time. Consider a staging environment with a web server. Typically, there is no load on the web server and the automatic threshold is set accordingly. When someone logs in, the threshold will be breached and an alert may be raised by the system. This is a false alert because one user logging in does not signify a situation of interest to an IT manager. This scenario shows that while automatic thresholding reduces the effort involved in configuring the monitoring tool (because IT managers do not have to configure thresholds for every metric and server), it does not eliminate false alerts.
Therefore, eG Enterprise allows IT managers to use a combination of static and automatic thresholds. A static threshold applied along with an automatic threshold provides a realistic boundary that has to be crossed before an alert is to be triggered.
You can set the auto-static combination of a measure by picking the Specific Values option from the list box against both the Static and Automatic from the CONFIGURE THRESHOLDS page.
- None
If the threshold policy for a measurement is none, an eG agent will stop tracking the state of this measurement (i.e., The agent will continue to collect values for this measurement but will not generate any alarms relating to this measurement).
Using the above-mentioned thresholding policies, you can set either the default or specific thresholds for the chosen test.
- Specifying thresholds
The next step towards configuring the thresholds is to specify the appropriate threshold values of your choice in the CONFIGURE THRESHOLDS ppage using the threshold policies discussed in the previous section. eG Enterprise offers the flexibility of generating alerts for the measures in abnormal state i.e., the alarms are generated when the value of the measure is extremely low or high. To achieve this capability, eG Enterprise brings in a concept of multiple thresholding using which you can set Minimum/Maximum thresholds for the measure. The capability of providing both the Minimum and Maximum thresholds is also supported.
Setting a Minimum/Maximum threshold involves providing the threshold values according to the thresholding policies mentioned in the section discussed above. You can choose any one of the configuration types i.e., Default/Specific and choose the test for which thresholds have to be configured based on the thresholding policy of your choice.
- Specifying the thresholds using the static thresholding policy
eG Enterprise system offers three levels of thresholds that correspond to the three alarm priorities - Critical, Major, and Minor. The user has to specify three maximum and/or three minimum thresholds in the Maximum Threshold and/or Minimum Threshold section of the CONFIGURE THRESHOLDS page. To specify the threshold values using the static thresholding policy, select Specific values from the drop down list against the Static option in the Maximum Threshold / Minimum Threshold sections. The Critical, Major and Minor text boxes will then appear enabling you to provide the threshold values of your choice. While the maximum thresholds are to be provided in the descending order, minimum thresholds have to be specified in the ascending order.
Note:
It is not mandatory to set the maximum and minimum thresholds for all the measures of the chosen test. Depending upon the nature of the test, you can decide on what type of threshold is more suitable for that measure and apply such threshold. For example, take the case of the Free physical memory measure. If the value of this measure falls below a specific value, then an alert must be generated indicating that the memory limit of this measure is low. In such situations, setting a minimum threshold would be more meaningful. Alternately if the CPU usage of a web server is to be monitored, a high value of this measure is a cause of concern. In this case, specifying a maximum threshold is required to generate appropriate alerts. Suppose if you are monitoring the session-wise information of a server, then you can set both the maximum and minimum threshold limits.
Let us now discuss how to configure thresholds for the Disk busy measure of the Disk Activity test.
By default, the Disk Busy measure of the Disk Activity test reports the percentage of elapsed time during which the disk is busy processing requests (i.e., reads or writes). The user can set a single maximum threshold of say, 90 (may be in the Critical, Major or the Minor text boxes), and expect to be alerted when the percentage of time the disk is busy crosses 90%. Alternatively, the user can also set multiple maximum thresholds, thereby instructing the eG Enterprise system to send different types of alerts at various levels of disk usage - in other words, the user can instruct the eG Enterprise system to trigger a Minor alert if the percentage of time the disk is busy crosses 80%, a Major alert if the percentage of time the disk is busy crosses 90%, and a Critical alert if it falls beyond 99%.
Suppose if the maximum threshold has already been set for the Disk busy measure using the static thresholding policy. To override such thresholds, you can just provide the values of your choice in the Critical/Major/Minor text boxes and click the Update button in this page. Suppose if you wish to set the values to 90/85/80 instead of the existing 99/90/80 values, you can do so by just overwriting the text boxes with the appropriate values. Since specifying a minimum threshold is not required for the Disk busy measure, you can just leave the Static drop down list in the Minimum Threshold section to display the value None.
Multiple levels of threshold settings allows proactive alarms to be generated when a metric is slightly out of conformance, and a severe alarm to be generated when the problem worsens. This provides ample opportunity to the user to identify and attack a problem early in its life cycle.
According to this specification, if the Maximum threshold of 90 is violated, then a Critical priority alarm will be generated. This is indicative of a critical issue with the host. Similarly, if the value of this measure crosses the Maximum threshold of 85, then a Major priority alarm will be generated. This is indicative of the existence of a major issue with the host. Likewise, a value beyond the Maximum threshold of 80 will result in a Minor priority alarm.
Note:
If the maximum/minimum threshold for a measure is set to none, then, it implies that such a threshold need not be computed for that measure. For instance, if you set all the thresholds for a measure to None, it means that the thresholds need not be computed for that measure. This is why, when you revisit the DEFAULT THRESHOLDS page to simply view or modify the threshold specifications of the measures, you will find that the measure appears in the Measures without threshold section.
- Specifying the thresholds using the Automatic thresholding policy
As discussed earlier, eG Enterprise uses tried and tested statistical quality control techniques to analyze past values of the metrics and automatically sets the upper and lower bounds for each of the metrics, using the past history of the values of the metrics.
Even when thresholds are set automatically, an IT manager may want to choose a leniency factor for the thresholds. For example, an IT manager may want to allow for an additional 10% tolerance on the norm. To accommodate such requests, eG Enterprise allows administrators to set a value using the “sensitivity slider” for automatic thresholds. Typically, to automatically compute minimum/maximum thresholds for a measure, you have to choose the Specific Values option from the drop down list against the Automatic option. The sensitivity slider will now appear using which you can set a threshold value for each alarm priority i.e., you are providing a desired percentage of tolerance on the automatically computed threshold value. The automatic threshold thus set will be represented as a percentage of ‘auto’. Suppose if you have set a 10% tolerance on the Minimum threshold of the Free physical memory measure to generate a Critical alert, then the value representation will be 10% of auto. This indicates that a 10% tolerance has been applied on the automatically computed threshold value. For example, consider the case of the Free physical memory measure, which is an indicator of the amount of free memory available on a server. Assume that on one of the managed servers, the free memory is known to decrease consistently and then grow back up (for e.g., the operating system frees memory periodically). In such a scenario, the free memory threshold will be violated often (since the value decreases consistently), and this will result in a number of false alerts. In such a situation, the eG administrator can set the threshold by moving the sensitivity slider to the desired value - for example, if the minimum threshold is set to 30, it implies that the administrator has introduced a 30% tolerance on the automatically computed threshold value. That is, alerts are generated only if the free memory is 30% lower than what the normal value is. This capability allows administrators to fine-tune eG's automatic thresholding capability to suit their specific requirements.
Like static thresholds, multiple automatic threshold values should only be set - one each for every alarm priority. Say for example, the administrators wanted to be alerted to the erosion of Free physical memory on a target server, at various stages. While they wanted proactive minor alerts to be generated if the free memory was 70 % lower than normal, a major alert was required for a 50% reduction in free memory, and a critical alert for an alarming 30% depletion of the memory resources. To ensure this, you can move the “sensitivity slider” against each of the alarm priority labels (i.e., Critical, Major and Minor) to 30, 50 and 70 respectively in the Automatic option of the Minimum Threshold section.
When you click the Update button in the CONFIGURE THRESHOLDS page, these values will be reflected as 30% of auto, 50% of auto and 70% of auto respectively. The variable auto indicates that the threshold value is calculated based on the past history (i.e., by a default period of 14 days) and a tolerance of say for example, 70% is given on the auto-computed value to generate a Critical alert.
- Specifying the thresholds using the Auto-static threshold combination policy
As explained in the previous section, an IT manager can now configure a static maximum and an automatic maximum threshold for a metric. eG Enterprise compares the actual measurement value with the higher of the two maximum thresholds, and generates an alert only when the higher threshold is violated. In the example of the staging web server, the IT manager can set a static maximum of 100 requests in a measurement period (or a similar number representing a reasonable load). Once this is done, only if the actual load exceeds 100 requests in a measurement period, will an alert be generated, even if the auto-computed threshold is less than 100. If the auto-computed threshold is greater than 100, this value is used as the actual threshold.
Consider the case of a Citrix MetaFrame server where you would be required to monitor the number of sessions that are active over a period of time. In such a case, automatic thresholds will be set by default. When a single session becomes active over the server, then a false alert may be generated which is erroneous. To avoid such false alerts, providing a static threshold along with the automatic threshold would be more meaningful. In this case, provide the Static and Automatic threshold values for the Active Sessions measure in the CONFIGURE THRESHOLDS page.
Once you click on the Update button in this page, the thresholds for that measure will be updated as per your requirement. This indicates that when a Maximum static and Maximum automatic thresholds are set at the same time, the threshold will be indicated as max(static threshold value, automatic threshold value). In our example, the values are specified as max(15, 400% of auto) for a critical alert to be generated. Here, for an alert to be generated for this measure, the threshold value must either breach the maximum of the static threshold value specified or the automatic threshold value. As in the case with the minimum thresholds, if a static minimum and an automatic minimum threshold are specified, then eG Enterprise will generate alarms only when the current value falls below the lower of the two threshold settings.
- Selecting an Alarm Policy
The final step to configure the thresholds for the measure is to select a suitable alarm policy. This alarm policy specification indicates when alarms should be generated by the eG manager. Just choose an alarm policy from the Policy list box of the Alarm Policy section in the CONFIGURE THRESHOLDS page before updating the thresholds for the chosen measure. The priority that will be assigned to such an alarm depends upon the threshold configuration and its corresponding alarm policy specification. By default, the following rules are applied when determining the alarm priority, if the number of violations in a time window matches the alarm policy specification (e.g., 4 threshold violations out of 6 consecutive measurements):
- If all violations are critical, then alarm priority would be critical
- If all violations are major, then the alarm priority would be major
- If all the violations are minor, then the alarm priority is minor
- If the number of critical violations is greater than the number of major, and the number of critical violations is greater than the number of minor violations, then the alarm priority is critical
- If the number of major violations is greater than or equal to the number of critical violations, and the number of major violations is greater than the number of minor violations, then the alarm priority is major
- In all other cases, the alarm priority is minor
Using the above-discussed procedure, you can configure both the default and specific thresholds for the tests.
concept of "monitoring by exception". Unlike a majority
of monitoring systems that tend to centralize the monitoring intelligence
in a central manager(s), the eG architecture decentralizes the
monitoring intelligence. Soon after it makes a measurement, the eG agent compares the measurement result with historical thresholds to
determine if the measurement is within norms. If it is not, the agent
reports the status to the appropriate manager(s). Thus the eG manager(s) does not have to compute the state of each and every
measurement being made by the agents.
The Default/Specific option in the Thresholds sub-menu of the
Alerts menu brings the user to this page, which
enables the user to configure the threshold values. Here, the type of the
component for which the administrator wants to configure the thresholds can
be selected from the Component type list box. This
list box lists only those component-types in the target environment that have
been managed by the eG Enterprise system. The
tests that map to the selected component appear. By default, the only the Enabled tests pertaining to the selected component are listed here.the Enabled tests list is further split into Enabled Tests for <the_chosen_application> and Enabled Tests for Operating System. While all the active application-level tests will be displayed in the first section, the second section will display all the active host-level tests. To view the disabled tests, click on the Disabled tests link at the top the list of enabled tests in this page. This will lead you straight to the list of disabled tests.
For each of the enabled tests, thresholds can be configured. To configure thresholds for a disabled test, first enable the test by clicking on the check box preceding the test name in this page, and then click on the Update button.
Default threshold values for the
measurements pertaining to a particular test
can be configured via the Default Thresholds button beside
the test.
Moreover, if you configure Default Thresholds for an application-level test, then such thresholds will govern the state of that test for the chosen application-type alone - for instance, if you configure Default Thresholds for say, the IIS Application Pools test mapped to the IIS web server component-type, then, these thresholds will automatically apply to the IIS Application Pools test of all managed components of type IIS Web alone. On the contrary, if you override the Default Thresholds of an active host-level test, then these thresholds will automatically apply to all managed component-types to which that test is mapped. For instance, say you select IIS Web as the Component type from the threshold configuration page and proceed to configure Default Thresholds for the Disk Space test in the Enabled Tests for Operating System (i.e., active host-level tests) section of that page. Also, assume that you have additionally managed an Oracle Database and a Microsoft SQL server in your environment. Now, since the Disk Space test is mapped to the Oracle Database and Microsoft SQL components as well, the Default Thresholds that you set for the Disk Space test of the IIS Web server type, will automatically apply to the Disk Space tests of the Oracle Database and Microsoft SQL server types as well.
The Specific Thresholds button on the other hand allows the user to configure the thresholds for individual components.
Note:
While monitoring large environments, some tests executed by the eG agent report statistics on hundreds of descriptors. For example, the UserProfileTest reports the profile size of each and every Citrix or Terminal server user of a server. Likewise, the WinSvcStatusTest reports on the availability of each and every service of a Windows system. For such tests, storage of the threshold values for each hour for each descriptor can result in significant disk space usage in the eG database. In order to enable administrators to optimize database usage for tests that do not use the automatic threshold computation (i.e., relative thresholding) capability, eG Enterprise offers the FIXED THRESHOLD CONFIGURATION page. Once one/more tests are set to use Fixed Thresholds using this page, the eG manager retrieves thresholds for such tests from their configuration files, and does not store any thresholds for these tests in the eG database. As a tradeoff, the thresholds for these tests apply to all the servers being managed, and cannot be set specifically for every server or for each descriptor. Hence, the Specific Thresholds button for these tests are disabled in the eG administrative interface. Moreover, when configuring the default threshold for these tests, the threshold policy has to be either "absolute" or "none".
|