What Causes Replication Lag?
Replication lag occurs when the slaves (or secondaries) cannot keep up with the updates occuring on the master (or primary). Unapplied changes accumulate in the slaves' relay logs and the version of the database on the slaves becomes increasingly different from that of the master.
To work out what's causing the lag, you must determine which replication thread is getting backed up. Replication relies on three threads per master/slave connection: one is created on the master and two are created on the slave.
- The Slave I/O Thread. When you issue START SLAVE on a slave server, the slave creates this thread which connects to the master and requests a copy of the master's binary log.
- The Binlog Dump Thread. When the slave connects to the master, the master uses this thread to send the slave the contents of its binary log.
- The Slave SQL Thread. The slaves creates this SQL (or applier) thread to read the contents of the retrieved binary log and apply its contents.
Replication lag is caused when either the I/O Thread or SQL Thread cannot cope with the demands placed upon it.
If the I/O Thread is suffering, this means that the network connection between the master and its slaves is slow. You might want to consider enabling the slave_compressed_protocol to compress network traffic or speaking to your network administrator.
If it's the SQL thread then your problem is probably due to poorly-optimized queries that are taking the slave too long to apply. There may be long-running transactions or too much I/O activity. Having no primary key on the slave tables when using the ROW or MIXED replication format is also a common cause of lag on this thread: check that both your master and slave versions of tables have a primary key.
But how do you know which thread is affected? This depends on whether you are using binary log file names and positions or GTIDs.
If you are using binary log file names and positions, then executing SHOW MASTER STATUS and SHOW SLAVE STATUS gives you all you need to work out which thread is causing the problems:
master> SHOW MASTER STATUS;
+---------------+-----------+
| File | Position |...
+---------------+-----------+
| binlog.000002 | 121752008 |...
+---------------+-----------+
slave> SHOW SLAVE STATUS\G
********************* 1. row *********************
Slave_IO_State: Waiting for master to send event
Master_Host: 127.0.0.1
Master_User: replication
Master_Port: 22808
Connect_Retry: 60
Master_Log_File: binlog.000002
Read_Master_Log_Pos: 121409852
Relay_Log_File: relaylog.000002
Relay_Log_Pos: 119819329
Relay_Master_Log_File: binlog.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
...
Exec_Master_Log_Pos: 120003004
Relay_Log_Space: 121226377
You can calculate any I/O thread lag by subtracting the value of Read_Master_Log_Pos from Position.
Bytes behind master = Position - Read_Master_Log_Pos
= 121752008 - 121409852
= 342156 bytes
Note that this assumes that the binary log file name is the same on both the slave and the master, otherwise the calculation becomes more complex because you must take into account the size of the binary log files.
The following calculation tells you if the slave SQL thread is lagging behind the replication master:
Bytes behind master = Position - Exec_Master_Log_Pos
= 121752008 - 120003004
= 1749004 bytes
Or, if the slave's SQL thread is lagging behind its I/O thread, then this will be visible in the results of the following calculation:
Bytes behind master = Read_Master_Log_Pos - Exec_Master_Log_Pos
= 121409852 - 120003004
= 1406848 bytes
The Seconds_Behind_Master column in the output of SHOW SLAVE STATUS also gives you a rough idea of the extent of the delay. Don't rely on it too much though, because it doesn't take into account long-running transactions or any operations that it performs on the master that is not required on the slave. Its value is best monitored over a period.
If your replication setup uses GTIDs, then you'll use the following information to diagnose replication lag:
- Executed GTIDs. Shows which transactions have been executed on the instance. Available on both master and slave. You can also get the GTID of the latest transaction executed from the global gtid_executed status variable.
- Retrieved GTIDs. Shows which transactions the I/O thread has received. Available only on the slave.
-
Purged GTIDs. Shows which transactions have been purged from the binary log after completion. You're only interested in the transactions purged by the slave in this context.
Once you have the GTIDs you can use the GTID_SUBTRACT() function to calculate the difference in the slave's view of the world from that of the master. For example, the following query on the slave shows the GTIDs read from the binary log that have not yet been applied (SQL thread lag):
slave> SELECT GTID_SUBTRACT('96985d6f-2ebc-11e7-84df-08002715584a:5-133',
'96985d6f-2ebc-11e7-84df-08002715584a:26-132') AS MissingGTIDs;
+-----------------------------------------------+
| MissingGTIDs |
+-----------------------------------------------+
| 96985d6f-2ebc-11e7-84df-08002715584a:5-25:133 |
+-----------------------------------------------+
1 row in set (0.00 sec)
###从官方文档来看:
是因为一个大的查询事务在slave 库上跑,引起的其他事务的阻塞,因为会在seconds behind master 上出现大规避的滞后现象
同时,他也解释了 mk-heartbeat 这个工具更为有效的。
It's not a reporting error, that's just how MySQL replication works. It transfers over the query log, then runs the queries. If one of the queries take a long time to run, then all the other queries clog up until that one completes. That's why you see the spikes. The fact that mk-heartbeat shows a low average just means that it's not a general overload problem, just a few big queries (or, less likely, an occasional monster load spike on the slave).
Instantaneous "seconds behind master" is a fairly useless figure (except when you want to know Right Now how far behind you are). The mk-heartbeat stats are far better for getting an idea of how overloaded your replication is -- anything higher than about 2-3 seconds on average over a dozen or so pings and you're pooched.