|
Measures reported by RedisReplicaTest
At the base of Redis replication there is a very simple to use and configure leader follower (masterslave) replication: it allows replica Redis instances to be exact copies of master instances. The replica will automatically reconnect to the master every time the link breaks, and will attempt to be an exact copy of it regardless of what happens to the master.
This system works using three main mechanisms:
When a master and a replica instances are well-connected, the master keeps the replica updated by sending a stream of commands to the replica, in order to replicate the effects on the dataset happening in the master side due to: client writes, keys expired or evicted, any other action changing the master dataset.
When the link between the master and the replica breaks, for network issues or because a timeout is sensed in the master or the replica, the replica reconnects and attempts to proceed with a partial resynchronization: it means that it will try to just obtain the part of the stream of commands it missed during the disconnection.
When a partial resynchronization is not possible, the replica will ask for a full resynchronization. This will involve a more complex process in which the master needs to create a snapshot of all its data, send it to the replica, and then continue sending the stream of commands as the dataset changes.
Every Redis master has a replication ID: it is a large pseudo random string that marks a given story of the dataset. Each master also takes an offset that increments for every byte of replication stream that is produced to be sent to replicas, in order to update the state of the replicas with the new changes modifying the dataset. The replication offset is incremented even if no replica is actually connected, so basically every given pair of Replication ID, offset, identifies an exact version of the dataset of a master.
When the master-replica link goes down for some reason, replicas automatically reconnect with the masters and typically, continue the replication process without requiring a full resynchronization. This works by creating an in-memory backlog of the replication stream on the master side. When replicas connect to masters, they use the PSYNC command in order to send their old master replication ID and the offsets they processed so far. If the master run ID is still the same, and the offset specified is in the replication backlog on the master side, then replication will resume from the point where it left off. However if there is not enough backlog in the master buffers, or if the replica is referring to an history (replication ID) which is no longer known, then a full resynchronization happens: in this case the replica will get a full copy of the dataset, from scratch.
In a full master-slave synchronization, the master starts a background saving process in order to produce an RDB file. At the same time it starts to buffer all new write commands received from the clients. When the background saving is complete, the master transfers the database file to the replica, which saves it on disk, and then loads it into memory. The master will then send all buffered commands to the replica. This is done as a stream of commands.
The success of any replication system rests on how quickly slaves reconnect with master when the link goes down, and how rapidly data synchronization occurs between the masters and the slaves. If slaves take too long to reconnect with the master after losing contact, or if the replication backlog on the master side is not sized with enough memory to hold the replication streams, then the replication process will be sluggish. To avoid this, administrators should continuously monitor the steps in the replication process, proactively identify pain points, and promptly initiate measures to eliminate them, so that the dataset on the master and slave sides are in-sync at all times. This is where the RedisReplicaTest helps!
This test first determines whether the target server is a master or slave in the replication process. For a master, the test reports the number of slaves connected to that master and the master's replication offset. Additionally, the test also monitors the usage of the master's replication backlog, and alerts administrators if the backlog is not sized commensurate to its usage. If the target server is a slave, then the test reports the details of the master to which the slave connects. The health of the masterslave link is periodically checked, and link failures (if any) are brought to the immediate attention of administrators. Alerts are also sent out if the slave has not reconnected with the master long after the loss of communication with the master. With the help of these metrics, administrators can quickly spot anomalies in the replication process and initiate measures to resolve them.
Outputs of the test : One set of results for the target Redis server
The measures made by this test are as follows:
| Measurement |
Description |
Measurement Unit |
Interpretation |
| role |
Indicates the role of this server. |
|
The values that this measure can report and their corresponding numeric values are listed in the table below:
| Measure Value |
Numeric Value |
| Sentinel |
0 |
| Master |
1 |
| Slave |
2 |
Note:
This measure reports the Measure Values listed in the table above to indicate the role of the target server. The graph of this measure however, indicates the same using the numeric equivalents only.
|
| connected_slaves |
Indicates the number of slaves connected. |
|
Use the detailed diagnosis of this measure to know which slaves are connected to the master. |
| master_host |
Indicates the IP address/host name of the master. |
Number |
This measure will report a value only if the role measure reports the value Slave.
|
| master_port |
Indicates the port number of the master. |
Number |
This measure will report a value only if the role measure reports the value Slave. |
| master_link_status |
Indicates whether/not the slave is able to connect to the master. |
Number |
This measure will report a value only if the role measure reports the value Slave.
The values that this measure can report and their corresponding numeric values are listed in the table below:
| Measure Value |
Numeric Value |
| Down |
0 |
| Up |
1 |
Note:
This measure reports the Measure Values listed in the table above to indicate whether the master-slave link is up or down. The graph of this measure however, indicates the same using the numeric equivalents only. |
| master_link_down_since_seconds |
Indicates how long it has been since the master link went down. |
Seconds |
This measure will report a value only if the link between the master and slave is down. Ideally, the value of this measure should be low. |
| master_last_io_seconds_ago |
Indicates how long it has been since this slave last contacted the master. |
Seconds |
This measure will report a value only if the role measure reports the value Slave.
Ideally, the value of this measure should be lower than the value of the master_link_down_since_seconds measure |
| master_sync_in_progress |
Indicates whether/not this slave is syncing with the master. |
|
This measure will report a value only if the role measure reports the value Slave. The values that this measure can report and their corresponding numeric values are listed in the table below:
| Measure Value |
Numeric Value |
| Yes |
0 |
| No |
1 |
Note:
This measure reports the Measure Values listed in the table above to indicate whether the slave is syncing with the master. The graph of this measure however, indicates the same using the numeric equivalents only
|
| master_sync_left_bytes |
Indicates the amount of data that is yet to be sychronized. |
MB |
Lower the value, better will be replication performance. This measure will be reported only if a SYNC operation is in progress. |
| master_repl_offset |
Indicates the master's replication offset. |
Number |
This measure will report a value only if the role measure reports the value Master. |
| slave_repl_offset |
Indicates the slave's replication offset |
Number |
This measure will report a value only if the role measure reports the value Slave. The value of this measure should be lower than the value of the master_repl_offset measure for partial synchronization to occur. |
| repl_backlog_active |
Indicates whether/not the replication backlog is active. |
|
The values that this measure can report and their corresponding numeric values are listed in the table below:
| Measure Value |
Numeric Value |
| Yes |
0 |
| No |
1 |
Note:
This measure reports the Measure Values listed in the table above to indicate whether/not the replication backlog is active. The graph of this measure however, indicates the same using the numeric equivalents only.
|
| repl_backlog_first_byte_offset |
Indicates the master offset of the replication backlog buffer. |
Number |
|
| repl_backlog_size |
Indicates the size of the replication backlog buffer. |
MB |
|
| repl_backlog_histlen |
Indicates the size in bytes of the data in the replication backlog buffer. |
Number |
If the value of this measure is close to the value of the repl_backlog_size, it implies that the backlog is fast running out of space to accommodate the replication streams. You may want to increase the size of the replication backlog to avoid this. |
|