常见错误代码:

1864:从库slave_pending_jobs_size_max参数值设置太小导致

1032:从库找不到对应的行数据

1062:从库行数据已存在,主键冲突

一、Last_Errno: 1864

报错信息: 

Last_Errno: 1864
Last_Error: Cannot schedule event Rows_query, relay-log name /home/mysql/db4306/slave-relay-bin.025121, position 323175613 to Worker thread because its size 34112901 exceeds 16777216 of slave_pending_jobs_size_max.

原因:

slave_pending_jobs_size_max参数设置的值太小导致。也就是当前worker线程正在处理的event所需要的总内存大小,超过了slave_pending_jobs_size_max参数设置的值。

官网说明:

slave_pending_jobs_size_max

For multithreaded replicas, this variable sets the maximum amount of memory (in bytes) available to worker queues holding events not yet applied. Setting this variable has no effect on replicas for which multithreading is not enabled. Setting this variable has no immediate effect. The state of the variable applies on all subsequent START SLAVE commands.
The minimum possible value for this variable is 1024; the default is 16MB. The maximum possible value is 18446744073709551615 (16 exabytes). Values that are not exact multiples of 1024 are rounded down to the next-highest multiple of 1024 prior to being stored.
The value of this variable is a soft limit and can be set to match the normal workload. If an unusually large event exceeds this size, the transaction is held until all the worker threads have empty queues, and then processed. All subsequent transactions are held until the large transaction has been completed.

意思是:

在多线程复制中,该参数用于限制尚未应用的事件的工作队列的最大内存量(以byte为单位)。而对未启用多线程的复制是没有影响的(即slave_parallel_workers=0)。修改此参数后不会立即生效,需要重启slave。

该参数默认值是16777216(即16M),取值范围是1KB~16EB。

此变量的值是一个软限制,可以设置为与正常工作负载相匹配。如果异常大的事件超过此大小,则事务将被搁置,直到所有工作线程都有空队列,然后再进行处理。如果内存充足或延迟较大时,可以适当调大(这个值要比主库的max_allowed_packet大)。
 

解决: 

mysql> show variables like '%max_allowed_packet%';
+--------------------------+------------+
| Variable_name            | Value      |
+--------------------------+------------+
| max_allowed_packet       | 1073741824 |
| slave_max_allowed_packet | 1073741824 |
+--------------------------+------------+

设置从库slave_pending_jobs_size_max的大小,需要大于主库max_allowed_packet的大小。

stop slave;
set global slave_pending_jobs_size_max=1073741824;
start slave;

二、Last_Errno: 1032

模拟报错:

1、先创建一张表
CREATE TABLE IF NOT EXISTS myt1( id INT UNSIGNED AUTO_INCREMENT, name VARCHAR(255) , age INT , sex VARCHAR(255),PRIMARY KEY (id) );
INSERT INTO myt1(name,age,sex) VALUES ('zhansan',43,'M');
INSERT INTO myt1(name,age,sex) VALUES ('lisi',32,'M');
INSERT INTO myt1(name,age,sex) VALUES ('chenxia',27,'F');
INSERT INTO myt1(name,age,sex) VALUES ('wangwu',24,'M');
INSERT INTO myt1(name,age,sex) VALUES ('lijing',35,'M');
INSERT INTO myt1(name,age,sex) VALUES ('zhaoqian',22,'F');
INSERT INTO myt1(name,age,sex) VALUES ('zhouzuan',55,'M');

2、先在slave上删除一条数据,再到master上修改该条数据
slave库:delete from myt1 where id=3;
master库:update myt1 set sex="Man" where id=3;

报错信息:

Slave_IO_Running: Yes
 Slave_SQL_Running: No
			......
        Last_Errno: 1032
        Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executingtransaction '974060d4-5809-11ec-a945-78aa82c76790:26096' at master log mysql-bin.000009, end_log_pos 603513. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
            ......
Retrieved_Gtid_Set: 97399be4-5809-11ec-93cf-78aa82c785a0:1-21,974060d4-5809-11ec-a945-78aa82c76790:7074-26907
 Executed_Gtid_Set: 97399be4-5809-11ec-93cf-78aa82c785a0:1-20,974060d4-5809-11ec-a945-78aa82c76790:1-26095,db53b93e-580d-11ec-aae7-78aa82c783e0:1-41

原因:

由于DELETE操作所涉及的行记录在slave中未找到引起的。

若用户在slave库上误删了某条数据时,由于slave操作不会同步到master,就会导致slave上少了这条数据,而master上还存在这条数据。

master上只要不对这条数据执行修改操作,slave库状态都是正常的,而一旦主库对该条记录进行修改,从库就会报1032错误。
 

解决:临时方法(跳过报错)

1)基于传统点位复制的方式: 跳过指定的事件数

sql_slave_skip_counter (参数说明)

stop slave;
 set global sql_slave_skip_counter=1;
 start slave;

若是使用gtid方式的主从复制,则会出现以下报错(需改用set gtid_next方式):

mysql> set global sql_slave_skip_counter=1;
mysql> ERROR 1858 (HY000): sql_slave_skip_counter can not be set when the server is running with @@GLOBAL.GTID_MODE = ON. Instead, for each transaction that you want to skip, generate an empty transaction with the same GTID as the transaction

2)基于GTID复制的方式:手动指定跳过报错的事务

stop slave sql_thread;
 set gtid_next='974060d4-5809-11ec-a945-78aa82c76790:26096';        # 跳过报错的GTID
 begin;
 commit;
 set gtid_next='automatic';
 start slave sql_thread;

gtid_next方式只是跳过1个事件,如果是在master上批量update了很多条slave不存在的记录数据时,每一条update语句都会对应一个事件,那就要执行很多次上面跳过操作,比较麻烦。(gtid_purged方式能跳过多个事务,具体怎么用,还未测试过)

另外还有一个问题是,如果这个事务中存在多个SQL,这种方式会使得该事务中的其他正常的SQL也会被跳过,无法应用到slave上。

3)设置slave_exec_mode参数,改为幂等模式

设置全局变量slave_exec_mode为IDEMPOTENT,它会忽略1032和1062的错误,对同一事务内的其他SQL执行无影响。

slave_exec_mode (参数说明)

stop slave;
 set global slave_exec_mode="idempotent";         # 默认值是STRICT
 start slave;
 set global slave_exec_mode="strict";                   # slave正常启动后要恢复原来的值

解决:永久方法(到slave上修复数据)

根据报错中提示的end_log_pos位置或GTID,在master的binlog日志中,找到对应的数据,然后再到slave上处理并修复引起异常的数据。

比如,按照上面的报错提示,用mysqlbinlog命令查看master的mysql-bin.000009文件,定位到 974060d4-5809-11ec-a945-78aa82c76790:26096 事件(或end_log_pos为603513位置),可看到是由于在master上执行了update操作引起slave的报错。

[root@localhost /home/mysql/data3020/mysql] $ mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000009 | grep -B2 -A50 974060d4-5809-11ec-a945-78aa82c76790:26096
# at 603240
#211210  9:49:35 server id 2749362147  end_log_pos 603305 CRC32 0x9d0e1a98  GTID [commit=no]
SET @@SESSION.GTID_NEXT= '974060d4-5809-11ec-a945-78aa82c76790:26096'/*!*/;
# at 603305
#211210  9:49:35 server id 2749362147  end_log_pos 603379 CRC32 0xcd395b74  Query thread_id=85785 exec_time=0 error_code=0
SET TIMESTAMP=1639100975/*!*/;
SET @@session.sql_mode=0/*!*/;
/*!\C latin1 *//*!*/;
SET @@session.character_set_client=8,@@session.collation_connection=8,@@session.collation_server=33/*!*/;
BEGIN
/*!*/;
# at 603379
#211210  9:49:35 server id 2749362147  end_log_pos 603435 CRC32 0x72fa8f8d  Table_map: `testdb`.`myt1` mapped to number 125
# at 603435
#211210  9:49:35 server id 2749362147  end_log_pos 603513 CRC32 0x66c8021f  Update_rows: table id 125 flags: STMT_END_F
### UPDATE `testdb`.`myt1`
### WHERE
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
###   @2='chenxia' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
###   @3=27 /* INT meta=0 nullable=1 is_null=0 */
###   @4='F' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
### SET
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
###   @2='chenxia' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
###   @3=27 /* INT meta=0 nullable=1 is_null=0 */
###   @4='Man' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
# at 603513
#211210  9:49:35 server id 2749362147  end_log_pos 603544 CRC32 0x878b1c46  Xid = 1337914
COMMIT/*!*/;

解决方式就是,在master的binlog日志中找到delete的那条数据后,再做反转(变成insert),将误删的数据恢复到slave上即可,如下:

slave库上执行:INSERT INTO myt1(id,name,age,sex) VALUES(3,'chenxia',27,'F');

三、Last_Errno: 1062

 模拟报错:

1、先创建一张表
CREATE TABLE IF NOT EXISTS myt1( id INT UNSIGNED AUTO_INCREMENT, name VARCHAR(255) , age INT , sex VARCHAR(255),PRIMARY KEY (id) );
INSERT INTO myt1(name,age,sex) VALUES ('zhansan',43,'M');
INSERT INTO myt1(name,age,sex) VALUES ('lisi',32,'M');
INSERT INTO myt1(name,age,sex) VALUES ('chenxia',27,'F');
INSERT INTO myt1(name,age,sex) VALUES ('wangwu',24,'M');
INSERT INTO myt1(name,age,sex) VALUES ('lijing',35,'M');
INSERT INTO myt1(name,age,sex) VALUES ('zhaoqian',22,'F');
INSERT INTO myt1(name,age,sex) VALUES ('zhouzuan',55,'M');

2、先在slave上插入一条数据,再到master上插入一个相同自增id的数据
slave库:INSERT INTO myt1(id,name,age,sex) VALUES(20,'hahahah',33,'M');
master库:INSERT INTO myt1(id,name,age,sex) VALUES(20,'hahahah',33,'M');

报错信息:

Slave_IO_Running: Yes
 Slave_SQL_Running: No
			......
        Last_Errno: 1062
        Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executingtransaction '974060d4-5809-11ec-a945-78aa82c76790:28807' at master log mysql-bin.000010, end_log_pos 557694. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
            ......
Retrieved_Gtid_Set: 97399be4-5809-11ec-93cf-78aa82c785a0:1-24,974060d4-5809-11ec-a945-78aa82c76790:7074-28807
 Executed_Gtid_Set: 97399be4-5809-11ec-93cf-78aa82c785a0:1-24,974060d4-5809-11ec-a945-78aa82c76790:1-28806,db53b93e-580d-11ec-aae7-78aa82c783e0:1-47

原因:

由于INSERT操作所涉及的主键行记录在slave中已存在,引起的PRIMARY重复导致。

比如若用户在slave库上手动新增了一条数据时,且表的primary key字段是自增ID,这时再到master上插入同一个自增id的数据时,就会导致slave库报1062错误。

解决:

由于slave上数据已存在,直接跳过该事务即可。

stop slave sql_thread;
 set gtid_next='974060d4-5809-11ec-a945-78aa82c76790:28807';
 begin;
 commit;
 set gtid_next='automatic';
 start slave sql_thread;
 show slave status\G

stop slave;
 set global slave_exec_mode="idempotent";
 start slave;
 set global slave_exec_mode="strict";

四、Last_IO_Errno : 1595

报错信息:

Slave_IO_Running : No
Slave_SQL_Running : Yes
Last_IO_Errno : 1595
Last_IO_Error : Relay log write failure: could not queue event from master

解决:

stop slave;
start slave;