背景

环境

  • 数据库版本:5.6.24-log
  • 主从架构:一主两从搭配mha

操作步骤

  1. 开发要求truncate一张表,在删除之前需要备份下数据;
  2. 在从库上使用mysqldump命令做逻辑备份,备份sql:
mysqldump -uXXX -p osdc osdc_XXX > /tmp/osdc_info.sql;
  1. 这台从库与主库的同步出现中断,报错为:
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1756
Last_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details).

ERROR NO 是1756,而且只是 Slave_SQL_Running 停了。

处理方式

  1. 停止同步:
stop slave;

2 打开同步:

start slave;

重新启停同步服务后,恢复正常。

分析

之前形形色色的主从问题遇到过很多,但是 1756 这个错误还是第一次遇到,所以事后还是很有兴趣去深入分析下这个问题。

1. 详细报错

查看数据库的 error log ,这个报错的具体输出如下:

2020-08-18 22:04:19 208840 [Warning] Slave SQL: Worker 2 failed executing transaction 'fe0ed3ca-cacc-11e9-8d49-6c92bfcf73b6:1377629720' at master log mysql-bin.003536, end_log_pos 861741231
; Could not execute Update_rows event on table osdc.osdc_XXX; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's
master log mysql-bin.003536, end_log_pos 861741231, Error_code: 1205

2020-08-18 22:04:19 208840 [Note] Worker 2 is exiting: killed 0, error 1, running_status 1

2020-08-18 22:04:19 208840 [ERROR] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency aut
omatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for detail
s). Error_code: 1756

错误日志里有个 Lock wait timeout exceeded ,这个我们先记下,稍后再说。
先找到 master 事务的 gtid,直接去主库的 binlog 里找下这个 gtid 所对应的 sql:

SET @@SESSION.GTID_NEXT= 'fe0ed3ca-cacc-11e9-8d49-6c92bfcf73b6:1377629720'/*!*/;

# at 861740635

#200818 21:58:42 server id 31  end_log_pos 861740715 CRC32 0xa9ab0dd2   Query   thread_id=9403182       exec_time=0     error_code=0
SET TIMESTAMP=1597759122/*!*/;
BEGIN
/*!*/;

# at 861740715

#200818 21:58:42 server id 31  end_log_pos 861740840 CRC32 0xbd8ba613   Rows_query

# update osdc_XXX
#         set modify_time=now(),status='04'
#         where id=9298XXXXX and status='01'

可以看到,这时候主库对 osdc_XXX 表进行了 update 操作。

2. 问题分析

  1. 首先,mysqldump 不加任何参数去执行,会对备份的表加表级锁。
  2. 这时候从库需要执行从主库同步过来的 update 语句,因为 mysqldump 表锁的存在,该语句会处理等待状态。
  3. update 等待超时,这个错误我们在上面的 error log 里可以看到(Lock wait timeout exceeded)。
  4. 因为在5.6.26版本下,在使用 MTS 的情况下,slave_transaction_retries 是不生效的。所以当 update
    语句超时后,它没有重试机制,导致整个 SQL threads 都停掉了(这也就是为什么 Slave_IO_Running 显示YES,而
    Slave_SQL_Running 显示NO)。

在 rpl_slave.cc 里可以看到:

----
/* MTS technical limitation no support of trans retry */
if (mi->rli->opt_slave_parallel_workers != 0 && slave_trans_retries != 0
----

这个 bug 很早就由 Yoshinori Matsunobu 提在 https://bugs.mysql.com/bug.php?id=68465 了,官方也采纳了,并于 MySQL 5.7.5 修复掉了该问题。

Yoshinori Matsunobu 就是大名鼎鼎的 MHA 的作者!

3. 问题复现

相关参数

slave_parallel_workers:3

[root@localhost][(none)]> show variables like '%slave_parallel_workers%';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| slave_parallel_workers | 5     |
+------------------------+-------+

innodb_lock_wait_timeout:50

[root@localhost][(none)]> show variables like '%innodb_lock_wait_timeout%';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_lock_wait_timeout | 50    |
+--------------------------+-------+

slave_transaction_retries:10

[root@localhost][none]> show variables like '%slave_transaction_retries%';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| slave_transaction_retries | 10    |
+---------------------------+-------+

测试数据:

[root@localhost][bug]> select * from repl_test;
+------+-------+
| id   | info  |
+------+-------+
|    1 | tom   |
|    2 | jerry |
+------+-------+

从库加锁操作

从库:

[root@localhost][bug]> begin;update repl_test set info='all';
Query OK, 0 rows affected (0.00 sec)

Query OK, 2 rows affected (0.01 sec)
Rows matched: 2  Changed: 2  Warnings: 0

显性开启事务,让它持续持有 InnoDB lock

主库更新操作

主库:

[root@localhost][bug]> update repl_test set info='all' where id=1;commit;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Query OK, 0 rows affected (0.00 sec)

手工更新 repl_test 表的一条数据,此时从库的复制线程会等待这条sql的应用,并处于 system lock 状态:

从库:

[root@localhost][(none)]> show processlist;

+----+-----------------+-----------+------+---------+------+-----------------------------------------------------------------------------+--------------------------------------------+
| Id | User            | Host      | db   | Command | Time | State                                                                       | Info                                       |
+----+-----------------+-----------+------+---------+------+-----------------------------------------------------------------------------+--------------------------------------------+
|  6 | event_scheduler | localhost | NULL | Daemon  |  545 | Waiting on empty queue                                                      | NULL                                       |
|  9 | root            | localhost | bug  | Sleep   |   10 |                                                                             | NULL                                       |
| 15 | system user     |           | NULL | Connect |   44 | Waiting for master to send event                                            | NULL                                       |
| 16 | system user     |           | NULL | Connect |    5 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL                                       |
| 17 | system user     |           | NULL | Connect |   44 | Waiting for an event from Coordinator                                       | NULL                                       |
| 18 | system user     |           | NULL | Connect |   44 | Waiting for an event from Coordinator                                       | NULL                                       |
| 19 | system user     |           | NULL | Connect |    5 | System lock                                                                 | update repl_test set info='all' where id=1 |
| 20 | root            | localhost | NULL | Query   |    0 | init                                                                        | show processlist                           |
+----+-----------------+-----------+------+---------+------+-----------------------------------------------------------------------------+--------------------------------------------+
8 rows in set (0.04 sec)

errorlog

错误日志开始打印错误信息:

2020-08-20 11:01:00 22391 [Warning] Slave SQL: Worker 2 failed executing transaction '' at master log mysql-bin.000272, end_log_pos 654587717; Could not execute Update_rows event on table bug.repl_test; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 654587717, Error_code: 1205
2020-08-20 11:01:00 22391 [ERROR] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
2020-08-20 11:01:00 22391 [Note] Error reading relay log event: slave SQL thread was killed

报的错与当时的内容一致。

4. 总结

  1. mysqldump 即使在从库中执行,也必须加上 --single-transcation 等参数,直接执行会上表锁,成本大大。
  2. 部分mysql工具,如 navicat 直接使用它自带的导出功能,也会锁住全表。所以尽量不要使用工具去处理导出工作。
mysql> show open tables where in_use >0 ;
+----------+-----------+--------+-------------+
| Database | Table     | In_use | Name_locked |
+----------+-----------+--------+-------------+
| osdc     | osdc_XXX |      1 |           0 |
+----------+-----------+--------+-------------+
1 row in set (0.00 sec)
  1. 5.6 的主从的坑已经踩了很多了,大多是由于自身的bug,而且在 5.6 下根本体现不出 MTS 的优势。把 5.6 升级至 5.7 或 8.0 是非常有必要的,之前测试的 8.0 的复制稳定性和性能的提升非常大,推荐直接升级至 8.0 版本。