eG Monitoring
 

Measures reported by UCSCsMemErrorTest

A DIMM or dual in-line memory module comprises a series of dynamic random-access memory integrated circuits. The Cisco UCS Manager may comprise of multiple DIMMs that server as the main source of memory for the blade servers hosted on the Cisco UCS Manager. The functioning of the blade servers depends extensively on the DIMMs. When errors are detected on the DIMMs, the blade servers would be the first to get affected. The errors on the DIMMs may occur due to the following reasons:

  • Use of third-party DIMMs which are not certified by Cisco;

  • When the DIMM is not oriented correctly in the slot;

  • When the DIMM is reported as unrecognized/inoperable/degraded/overheating;

The memory errors encountered by the Cisco UCS Manager are classified as follows:

  • Correctable and Uncorrectable Errors

  • Detected and Undetected Errors

  • Hard and Soft Errors

These errors when left unattended may result in the failure of some virtual servers hosted on the blade servers of the Cisco UCS Manager and in the worst case may result in the failure of the blade servers itself! To avoid such casualities, it is necessary to monitor the errors detected on the Cisco UCS Manager and rectify the same before end users start complaining about the blade servers being inaccessible. The UCSCsMemErrorTest test helps in this regard!

This test continuously tracks the memory errors occurring in the DIMM of the Cisco UCS Manager and reports the number of memory errors that occurred during various time slots. By regularly analyzing the metrics reported by this test, administrators can determine when exactly the error occurrence was high and troubleshoot the memory issues better.

Outputs of the test : One set of results for the Cisco UCS Manager that is being monitored.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Errors1min Indicates the number of errors occurred during last minute. Number Ideally, the value of this measure should be zero. A gradual/sudden increase in the value of this measure is a cause for concern.
Errors15mins Indicates the number of errors occurred during last 15 minutes. Number  
Errors1hour Indicates the number of errors occurred during last 1 hour. Number  
Errors1Day Indicates the number of errors occurred during last 1 day. Number  
Errors1week Indicates the number of errors occurred during last 1 week. Number  
Errors2weeks Indicates the number of errors occurred during last 2 weeks. Number