eG Monitoring
 

Measures reported by OraExaCelStsTest

Storage cells are configured on the network, and are managed by the Oracle Exadata System Software CellCLI utility. Storage servers contain cell- based utilities and processes from Oracle Exadata System Software, including:

  • Cell Server (CELLSRV) - This is the primary component of the Oracle Exadata System Software running in the storage server, which provides the majority of the storage server services. CELLSRV services database requests for disk I/O and provides the advanced SQL offload capabilities.

  • Offload Server (CELLOFLSRV) - This is a helper process to the Cell Server that processes offload requests from a specific Database version. These processes allow the Storage server to respond to requests from multiple database versions residing on the same or multiple Database servers.

  • Management Server (MS) - The primary interface to administer, manage and query the status of the storage server. It works in cooperation with the Cell Control Command-Line Interface(CellCLI) and processes most of the commands from CellCLI.

  • Restart Server (RS) - Monitors the heartbeat with the MS and the CELLSRV processes, and restarts the servers if they fail to respond within the allowable heartbeat period.

If any of the cell-based utilities are unavailable/offline/stopped, then the functioning of the storage cell may slow down resulting in poor I/O processing. Also, a sudden hardware failure or an increase in the temperature of the storage cell may result in malfunctioning of the storage cell. To avoid such serious damages and to ensure that the storage cell is functioning at its peak efficiency, it is essential to keep a constant vigil on the performance of the storage cell. This is where the OraExaCelStsTest test helps!

This test monitors the status of the storage cell. This test also monitors the cell-based utilities of the storage cell and reports the utilities that are offline or stopped. Failure of the hardware components (power supply, fan) are proactively detected and reported. The physical memory utilization and CPU utilization of the cell server and management server helps administrators figure out the server that is consuming too much of resources.

Outputs of the test :One set of results for the target Oracle Exadata Storage Server that is being monitored

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
cellStatus Indicates the current status of the storage cell or target storage server.  

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Offline 0
Online 100

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of the storage cell. However, in the graph of this measure, the status of the storage cell will be represented using the corresponding numeric equivalents only - i.e., 0 or 100.
fanStatus Indicates the current status of the fan operating in the storage cell.  

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Normal 100
Warning 90
Critical 50

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of the fan. However, in the graph of this measure, the status of the fan will be represented using the corresponding numeric equivalents mentioned in the table above.
temperatureStatus Indicates the current temperature status of the storage cell.  

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Normal 100
Warning 90
Critical 50

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current temperature status of the storage cell. However, in the graph of this measure, the temperature status will be represented using the corresponding numeric equivalents mentioned in the table above.
powerStatus Indicates the current power status of the storage cell.  

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Normal 100
Warning 90
Critical 50

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current power status of the storage cell. However, in the graph of this measure, the power status will be represented using the corresponding numeric equivalents mentioned in the table above.
statusOfCS Indicates the current status of the cell server in the storage cell.  

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Stopped 0
Running 100

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of the cell server. However, in the graph of this measure, the status of the cell server will be represented using the corresponding numeric equivalents mentioned in the table above.
statusOfMS Indicates the current status of the management server in the storage cell.  

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Stopped 0
Running 100

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of the management server. However, in the graph of this measure, the status of the management server will be represented using the corresponding numeric equivalents mentioned in the table above.
statusOfRS Indicates the current status of the Restart server in the storage cell.  

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Stopped 0
Running 100

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of the Restart server. However, in the graph of this measure, the status of the Restart server will be represented using the corresponding numeric equivalents mentioned in the table above.
locatorLEDStatus Indicates the current status of the Locator LED.  

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Off 0
On 100

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of the Locator LED. However, in the graph of this measure, the status of the Locator LED will be represented using the corresponding numeric equivalents mentioned in the table above.
uptime Indicates the total time that the storage cell has been up since its last reboot. Mins

Administrators may wish to be alerted if a storage cell has been running without a reboot for a very long period. Setting a threshold for this metric allows administrators to determine such conditions.

uptimeSinceLastMeasure The time period that the storage cell has been up since the last time this test ran. Secs

If the storage cell has not been rebooted during the last measurement period and the agent has been running continuously, this value will be equal to the measurement period. If the storage cell was rebooted during the last measurement period, this value will be less than the measurement period of the test. For example, if the measurement period is 300 secs, and if the storage cell was rebooted 120 secs back, this metric will report a value of 120 seconds. The accuracy of this metric is dependent on the measurement period - the smaller the measurement period, greater the accuracy.

isRestarted Indicates whether/not the storage cell was restarted.  

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
No 0
Yes 1

Note:

By default, this measure reports the above-mentioned Measure Values while indicating whether/not the storage cell was restarted. However, the graph of this measure will be represented using the corresponding numeric equivalents only.
batteryChargeOnDC Indicates the percentage of battery charge on the disk controller. Percent

A sudden/gradual decrease in the value of this measure indicates that the battery resource of the disk controller is depleting at a faster pace and the battery needs to be recharged/replaced.

temperatureOnDC Indicates the current temperature of the disk controller. Celsius

The temperature of the disk controller should always be maintained in admissible range. A sudden/gradual increase in the temperature results in over heating of the disk controller and eventually causes the storage server to malfunction.

cellTemperature Indicates the current temperature of the storage cell. Celsius

Ideally, the value of this measure should be within admissible range. A sudden/gradual increase in the value of this measure results in over heating of the storage cell and eventually causes the storage cell to malfunction.

physicalMemUtil Indicates the overall percentage of physical memory utilized by the storage cell. Percent

A value close to 100 is a cause of concern and warrants further investigation.

physicalMemUtilByCS Indicates the percentage of physical memory utilized by the cell server. Percent

A high value for this measure indicates that the cell server is consuming too much of physical memory.

physicalMemUtilByMS Indicates the percentage of physical memory utilized by the management server. Percent

A high value for this measure indicates that the management server is consuming too much of physical memory.

swapMemoryUsage Indicates the percentage of swap memory utilized by the storage cell. Percent

 

vMemoryUtilByCS Indicates the amount of virtual memory utilized by the cell server. GB

 

totalMemoryUtilByMS Indicates the total amount of memory utilized by the management server. GB

 

cpuUtil Indicates the percentage of CPU utilized by the storage cell. Percent

A value close to 100 is a cause of concern.

cpuUtilByCS Indicates the percentage of CPU utilized by the cell server. Percent

A high value for this measure indicates that the cell server is consuming too much of CPU resources.

cpuUtilByMS Indicates the percentage of CPU utilized by the management server. Percent

A high value for this measure indicates that the management server is consuming too much of CPU resources.