eG Monitoring
 

Measures reported by AzrRecvrySrvTest

A Recovery Services vault is a storage entity in Azure that houses data. The data is typically copies of data, or configuration information for virtual machines (VMs), workloads, servers, or workstations. You can use Recovery Services vaults to hold backup data for various Azure services such as IaaS VMs (Linux or Windows) and Azure SQL databases. The vault also stores recovery points created over time and backup policies associated with protected virtual machines. Recovery Services vaults support System Center DPM, Windows Server, Azure Backup Server,and more.

If the backup jobs keep failing or take too long to complete, then, when disaster strikes, a recent backup will not be available in the vault to enable seamless recovery. This can result in loss of critical business /configuration information. Administrators should therefore keep a close watch on the progress of backup jobs, and rapidly detect job delays and failures. Proactive detection of potential job failures is also essential, as it can help avert such irredeemable data losses.

Azure Backup automatically handles storage for the vault. It is important to know the storage replication type set for the vault, and how much of the redundant storage space is consumed by the backed up data. This will help administrators assess the storage requirement of the backups. Without this usage insight, there could come a time when there is not enough space in Azure storage for backups. In such situations, there is bound to be significant data loss.

To avoid backup failures, latencies, and storage space contentions in a Recovery Services Vault, administrators can periodically run the AzrRecvrySrvTest.This test monitors all the Recovery Services Vaults configured for every resource group of a target Azure subscription. For each vault, the test monitors the status of that vault, and alerts administrators if any errors/abnormalities are noticed in the vault. Additionally, the test notifies administrators if backup/recovery jobs fail, and also if VMs/protected items in any vault are in a Critical/Warning state. Moreover, the test also measures the storage space consumed by each vault in local and geo-redundant storage. In the process, the test points you to vaults that may be over-utilizing redundant storage. Furthermore, the test also draws administrator attention to backup jobs with critical issues, so that administrators can quickly troubleshoot the issues and avert backup job failures.

Outputs of the test :One set of results for every recovery services vault configured for each resource group of the target subscription

The measures made by this test are as follows:

>
Measurement Description Measurement Unit Interpretation
Rec_status Indicates the current status of this recovery services vault.   The values reported by this measure and its numeric equivalents are mentioned in the table below:

Measure Value Numeric Value
Succeeded 1
Updating 2
Error 3


Note:

By default, this measure reports the Measure Values listed in the table above to indicate the current status of the recovery services vault. In the graph of this measure however, the same is represented using the numeric equivalents only.

Use the detailed diagnosis of this measure to know the location, tier, and recovery service type of the vault.
Bkup_mngmt_srvr Indicates the number of backup management servers available in this vault. Number  
Bkup_itm Indicates the number of items backed up in this vault. Number  
IaaSVM_cnt Indicates the number of VMs in this vault. Number  
Prtctd_itm_crtcl Indicates the number of protected items in this vault that are in Critical state. Number These measures represent the replication health of protected items - i.e., items that are replication-enabled - in the vault. If an item is in the Critical state, it implies that one or more critical replication error symptoms have been detected in that item. These error symptoms are typically indicators that replication stuck, or not progressing as fast as the data change rate.

If an item is in the Warning state, it implies that one or more warning symptoms that might impact replication are detected in that item.

Ideally therefore, the value of these measures should be 0.
Prtctd_itm_wrng Indicates the number of protected items in this vault that are in Warning state. Number
IaaSVM_crtcl Indicates the number of VMs in this vault that are in Critical state. Number These measures represent the replication health of VMs in the vault.

If a VM is in the Critical state, it implies that one or more critical replication error symptoms have been detected in that VM. These error symptoms are typically indicators that replication stuck, or not progressing as fast as the data change rate.

If a VM is in the Warning state, it implies that one or more warning symptoms that might impact replication are detected in that VM.

Ideally therefore, the value of these measures should be 0.
IaaSVM_wrng Indicates the number of VMs in this vault that are in Warning state. Number
MAB_prtctd_itm Indicates the number of files and folders backed up to this vault. Number  
DPM_prtctd_itm Indicates the number of data protection managers registered with this vault. Number System Center Data Protection Manager (DPM) is a robust enterprise backup and recovery system that contributes to your BCDR strategy by facilitating the backup and recovery of enterprise data.

With DPM running on a physical server or on-premises VM can back up data to a Recovery Services vault in Azure, in addition to disk and tape backup. You can deploy DPM on an Azure VM, and can back up data to Azure disks attached to the VM, or back up the data to a Recovery Services vault.
Azr_bkup_srvr_itm Indicates the number of backup servers in this vault. Number  
In_prgrs_jobs Indicates the number of backup jobs that are in progress in this vault. Number If the value of this measure grows consistently, it could imply that the vault is taking longer than usual to process backup jobs. This could warrant an investigation.
Faild_jobs_cnt Indicates the number of backup jobs in this vault that failed. Number Ideally, the value of is thismeasure should be 0.
GRS_storg_usg Indicates the amount of space that has been used by this vault in Geo redundant storage in cloud. MB Geo-redundant storage (GRS) copies your data synchronously three times within a single physical location in the primary region using LRS. It then copies your data asynchronously to asingle physical location in a secondary region that is hundreds of miles away from the primary region.

Compare the value of this measure with that of the Cloud - LRS measure to know which type of redundant storage is excessively utilized by the vault.
LRS_storg_usg Indicates the amount of space that has been used by this vault in locallyredundant storage in cloud. MB Locally redundant storage (LRS) replicates your data three times within a single data center in the primary region.
Mngd_instns Indicates the number of managed instances in this vault. Number  
GRS_dedup_storg_usg Indicates the amount of data that has been deduplicated from the geo-redundant storage used by this vault. MB Data Deduplication, often called Dedup for short, is a feature that can help reduce the impact of redundant data on storage costs. When enabled, Data Deduplication optimizes free space on a volume by examining the data on the volume by looking for duplicated portions on the volume. Duplicated portions of the volume's dataset are stored once and are (optionally) compressed for additional savings.

If the values of these measures are low, while the values of the Cloud - GRS and Cloud - LRS are consistently growing, it could mean that enough data has not been deduplicated.
LRS_dedup_storg_usg Indicates the amount of data that has been deduplicated from the locally-redundant storage used by this vault. MB
Usd_dsk_sz Indicates the amount of disk space used by the backup engine. MB A high value is indicative of excessive disk space usage by the backup engine.
Prtctd_itm Indicates the number of replicated items in this vault. Number  
Rcvry_pln Indicates the number of recovery plans in this vault. Number A recovery plan gathers machines into recovery groups for the purpose of failover. A recovery plan helps you to define a systematic recovery process, by creating small independent units that you can fail over. A unit typically represents an app in your environment.

A recovery plan defines how machines fail over, and the sequence in which they start after failover. Recovery plans can be used for both failover to and failback from Azure.
Srvr_helth_crtcl Indicates the number of unhealthy servers in this vault. Number Ideally, the value of this measure should be 0.
Dprctd_srvr Indicates the number of servers registered with this vault that have updates available. Number If this measure reports a non-zero value, it could mean that one/more servers in the vault are missing some important updates. In such a case, it would be wise to update the servers without any delay, as outdated servers can cause backup/recovery failures.
Unspprtd_srvr Indicates the number of unsupported servers in this vault. Number  
Spprtd_srvr Indicates the number of supported servers in this vault. Number  
Events Indicates the number of events generated during recovery jobs in this vault. Number  
Faild_jobs Indicates the number of recovery jobs in this vault that failed. Number Ideally, the value of this measure should be 0.
InPrgrs_jobs Indicates the number of recovery jobs that are in progress in this vault. Number If the value of this measure grows consistently, it could imply that the vault is taking longer than usual to process recovery jobs. This could warrant an investigation.
Suspnd_jobs Indicates the number of recovery jobs in this vault that are waiting for input. Number  
Rgstrd_srvr_cnt Indicates the number of servers registered with this vault. Number  
Rcvry_srvc_prvdr_auth_typ Indicates the number of authentication types provided by this vault. Number  
Rplctng_prtctd_itm Indicates the number of protected items in this vault that are replicating currently. Seconds You perform a failover as part of your business continuity and disaster recovery (BCDR) strategy.

As a first step in your BCDR strategy, you replicate your onpremises items to Azure on an ongoing basis. Users access workloads and apps running on the on-premises sources.

If the need arises, for example if there's an outage on- premises, you fail the replicating items over to Azure.
Faild_ovr_prtctd_itm Indicates the number of items that were failed over to this vault. Number
TstFailOvr_applcbl_prtctd_itm Indicates the number of items in this vault that were failed over for test failover. Number You run a test failover to validate your replication and disaster recovery strategy, without any data loss or downtime. A test failover does not impact ongoing replication, or your production environment. You can run a test failover on a specific virtual machine (VM), or on a recovery plan containing multiple VMs.
HyperV_to_Azr_prtctd_itm Indicates the number of HyperV VMs replicated to this vault. Number  
VMM_to_Azr_prtctd_itm Indicates the number of VMM VMs replicated to this vault. Number  
Vmware_to_Azr_prtctd_itm Indicates the number of VMware VMs replicated to this vault. Number  
Azr_to_Azr_prtctd_itm Indicates the number of Azure VMs replicated to this vault. Number  
Crtcl Indicates the number of backup/recovery jobs in this vault that lead to the generation of a Critical alert. Number In principle, any backup or recovery failure (scheduled or user triggered) would lead to generation of an alert and would be shown as a Critical alert and also destructive operations such as delete backup.

Ideally therefore, the value of this measure should be 0.
Wrng Indicates the number of backup/recovery jobs in this vault that lead to the generation of a Warning alert. Number If the backup/recovery operation succeeds but with few warnings, they are listed as Warning alerts.

Ideally therefore, the value of this measure should be 0.