eG Monitoring
 

Measures reported by NAUSDAggrTest

To support the differing security, backup, performance, and data sharing needs of your users, you group the physical data storage resources on your storage system into one or more aggregates. These aggregates provide storage to the volume or volumes that they contain. Each aggregate has its own RAID configuration, plex structure, and set of assigned disks or array LUNs.

Periodically, you must monitor the state, I/O activity, processing power, and space usage of each of the aggregates configured on your storage system, so that probable space contentions and I/O overloads can be rapidly detected, and failed/inconsistent/busy aggregates can be easily identified. Also, to be able to accurately point to failed checksum storage, problematic RAID groups, or issues in plex resynchronization in an aggregate, the key components of each aggregate - such as, RAID groups, plex structures and checksum disks - should also be monitored from time to time. The NetApp Aggregates test provides all these performance insights.

This test auto-discovers the aggregates configured on a storage system, and periodically reports the following:

  • What is the current state of each aggregate?
  • Which are the busy aggregates?
  • Is any aggregate running short of storage space?
  • Is I/O load uniformly distributed across all aggregates, or is any aggregate overloaded with read-write requests?
  • What is the current status of the checksum storage in each aggregate?
  • What is the current status of the plex structures in each aggregate?
  • Are the RAID groups in an aggregate in a normal state?
  • Did any aggregate experience issues during plex resynchronization?

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Num_of_Aggr Indicates the number of busy aggregates in this storage system. Number This measure is applicable only to the Busy Aggregates descriptor only.

The detailed diagnosis capability of this measure, if enabled, lists out the name of the aggregate and the Transfer rate of each aggregate i.e., the rate at which data transfer is serviced by this aggregate.

State Indicates the curent state of this aggregate.   The values that this measure can report and their corresponding numeric values have been listed in the table below. A brief description for each Measure Value is also provided:

Measure Value Numeric Value Description
Creating 1  
Online 2 Read and write access to volumes hosted on this aggregate is allowed.
Restricted 3 Some operations, such as parity reconstruction, are allowed, but data access is not allowed.
Iron Restricted 4 A WAFL consistency check is being performed on the aggregate.
Partial 5 At least one disk was found for the aggregate, but two or moredisks are missing.
Offline 6 No access to the aggregate is allowed.
Failed 7  
Unknown 8  

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of this aggregate. However, the graph of this measure will be represented using the corresponding numeric equivalents i.e., 1 to 8.

Is_in_consistent Indicates whether/not this aggregate is inconsistent.   One of the reasons why an aggregate is marked as inconsistent or corrupted, is when the Lost write protection feature detects an issue. Lost write protection is a feature of Data ONTAP that occurs on each WAFL read. Data is checked against block checksum information (WAFL context) and RAID parity data. If an issue is detected, there are two possible outcomes:

  • The drive containing the data is failed.
  • The aggregate containing the data is marked inconsistent.

If an aggregate is marked inconsistent, it will require the use of WAFL iron to be able to return the aggregate to a consistent state.

This measure indicates a value of Yes if the aggregate is inconsistent and the value No if the aggregate is not inconsistent. The numeric values that correspond to the above-mentioned values are detailed in the table below:

Measure Value Numeric Value
Yes 1
No 2

Note:

By default, this measure reports the above-mentioned Measure Values while indicating whether/not this aggregate is inconsistent. However, in the graph of this measure, the inconsistent state of an aggregate will be represented using the corresponding numeric equivalents i.e., 1 or 2.

Mirror_Status Indicates the current mirror status of this aggregate.   The values that this measure can report and their corresponding numeric values have been listed in the table below. A brief description for a few Measure Values is also provided:

Measure Value Numeric Value Description
Unmirrored 1 The aggregate is not mirrored. Unmirrored aggregates have only one plex (copy of their data), which contains all of the RAID groups belonging to that aggregate.
Mirrored 2 The aggregate is mirrored. Mirrored aggregates have two plexes (copies of their data), which use the SyncMirror functionality to duplicate the data to provide redundancy.
Mirror Resynchronizing 3 One of the mirrored aggregate's plexes is being resynchronized
Un Initialized 4  
CP Count Check In Progress 5 WAFL consistency check is in progress
Needs CP Count Check 6 WAFL consistency check needs to be performed on the aggregate
Mirror Degraded 7 The aggregate is mirrored and one of its plexes is offline or resynchronizing
Invalid 8 The aggregate contains no volumes and none can be added. Typically this happens only after an aborted aggr copy operation.
Failed 9  
Limbo 10  

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current mirror status of this aggregate in this storage system. However, in the graph of this measure, the mirror status will be represented using the corresponding numeric equivalents - i.e., 1 to 10.

Raid_status Indicates whether/not the RAID of this aggregate is in an abnormal state currently.   This measure indicates a value of Yes if the RAID of this aggregate is in abnormal state and the value No if the RAID of this aggregate is not abnormal. The numeric values that correspond to the above-mentioned values are detailed in the table below:

Measure Value Numeric Value
Yes 1
No 2

Note:

By default, this measure reports the above-mentioned Measure Values while indicating whether the RAID of this aggregate is in abnormal state. However, the graph of this measure will be represented using the corresponding numeric equivalents i.e., 1 or 2.

Check_sum_status Indicates the current checksum status of this aggregate.   The values that this measure can report and their corresponding numeric values have been listed in the table below.

Measure Value Numeric Value
Active 1
Off 2
Reverting 3
None 4
Unknown 5
Initializing 6
Reinitializing 7
Reinitialized 8
Upgrading Phase1 9
Upgrading Phase2 10

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current checksum status of this aggregate in this storage system. However, the graph of this measure will be represented using the corresponding numeric equivalents i.e., 1 to 10.

Are_plexes_offline Indicates whether/not the plexes in this aggregate are currently offline.   A plex is a collection of one or more RAID groups that together provide the storage for one or more WAFL® file system volumes. Data ONTAP uses plexes as the unit of RAID-level mirroring when the SyncMirror® feature is enabled. All RAID groups in one plex are of the same level, but may have a different number of disks.

This measure indicates a value of Yes if the plexes in this aggregates are currently offline and the value No if the plexes are not offline. The numeric values that correspond to the above-mentioned values are detailed in the table below:

Measure Value Numeric Value
No 1
Yes 2

Note:

By default, this measure reports the above-mentioned Measure Values while indicating whether the plexes in this aggregate are currently offline. However, the graph of this measure will be represented using the corresponding numeric equivalents i.e., 1 or 2.

Are_Plexes_Resyncing Indicates whether/not the plexes of this aggregate are currently being resynchronized.   Plex resynchronization is a process that ensures two plexes of a mirrored aggregate have exactly the same data. When plexes are unsynchronized, one plex contains data that is more up to date than that of the other plex. Plex resynchronization updates the out-of-date plex so that both plexes are identical.

Data ONTAP resynchronizes the two plexes of a mirrored aggregate if one of the following situations occurs:

  • One of the plexes was taken offline and then brought online later.
  • You add a plex to an unmirrored aggregate.

This measure reports the value Yes if the plexes in this aggregate are currently resyncing and the value No if the plexes are not resyncing. The numeric values that correspond to the above-mentioned values are detailed in the table below:

Measure Value Numeric Value
No 1
Yes 2

Note:

By default, this measure reports the above-mentioned Measure Values while indicating whether the plexes in this aggregate are currently being resynchronized. However, the graph of this measure will be represented using the corresponding numeric equivalents i.e., 1 or 2.

Total_size Indicates the total usable size of this aggregate. MB The size of this aggregate excludes the WAFL reserve and the aggregate snapshot reserve. This measure will report a value of 0 if the aggregate is restricted or offline.
Used_size Indicates the amount of space that is currently used in this aggregate. MB This measure will report a value 0 if the aggregate is not usable i.e., offline.
size_percentage_used Indicates the percentage of space that is currently used in this aggregate. Percent A value close to 100% is an indication of space constraint in the aggregate.
Total_files Indicates the total number of files in this aggregate. Number  
Used_files Indicates the total number of files that are currently stored in this aggregate. Number  
Total_transfers Indicates the rate at which the transfers are serviced by this aggregate. Ops/sec  
User_reads Indicates the rate at which the read request from the user is serviced by this aggregate. Ops/sec A consistent decrease in the value of this measure could indicate a bottleneck when processing read requests. Compare the value of this measure across aggregates to know which aggregates service read requests slowly.
User_writes Indicates the rate at which the write request from the user is serviced in this aggregate. Ops/sec A consistent decrease in the value of this measure could indicate a bottleneck when processing read requests. Compare the value of this measure across aggregates to know which aggregates service block read requests slowly.
Cp_reads Indicates the rate at which the read request from the user is serviced during a Consistency Point (CP) operation in this aggregate. Ops/sec A consistent decrease in the value of this measure could indicate that CP operations are slowing down the processing of read requests.
User_read_blocks Indicates the rate at which the blocks are read from this aggregate upon a user request. Ops/Sec A consistent decrease in the value of this measure could indicate a bottleneck when processing read requests. Compare the value of this measure across aggregates to know which aggregates service block read requests slowly.
User_write_blocks Indicates the rate at which the blocks are written to this aggregate upon a user request. Ops/Sec A consistent decrease in the value of this measure could indicate a bottleneck when processing write requests. Compare the value of this measure across aggregates to know which aggregates are servicing block write requests slowly.
Cp_read_blocks Indicates the rate at which the blocks are read from this aggregate during a Consistency point (CP) operation upon a user request. Ops/Sec A consistent decrease in the value of this measure could indicate that CP operations are slowing down the processing of read requests.