近期工作中监控到一台MySQL Slave报了1146的SQL_THREAD ERROR,提示有一张表不存在。当时一脸懵逼,这个报错在我们的MySQL数据中心并没有出现过,而且只是一台MySQL实例突然出现的个体状况。
首先登录Slave去“ SHOW SLAVE STATUS ”,发现报错的语句是一句执行建表操作的DDL,而丢失的表正是该DDL要创建的表。即:在执行创建某表格的操作时报了该表格不存在的error。哇喔,二脸懵逼。
然后切到具体的库里面查看,该表格的确是存在的,可以SHOW出来,但是对该表执行一句简单的DQL,却还是报了1146参照完整性错误。三脸懵逼。
好吧,一定是碰到了“僵尸表”,不死不活的。还是去看看表文件吧。切到该库的datadir,grep一下该表,发现只有一个frm表定义文件。按照之前创建该表的DDL,是张InnoDB表,应该还有一个ibd表空间文件才对啊!结合之前的三次懵逼,猜想是不是该表在创建过程中只写完frm文件,还没来得及创建ibd就GG了?
于是赶紧去error log查看日志,筛选出与之相关的记录。
2018-01-11T02:16:38.857968Z mysqld_safe Number of processes running now: 0
2018-01-11T02:16:38.886278Z mysqld_safe mysqld restarted
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
2018-01-11T02:16:48.590391Z 0 [Note] InnoDB: Database was not shutdown normally!
2018-01-11T02:16:48.590399Z 0 [Note] InnoDB: Starting crash recovery.
2018-01-11T02:16:48.985033Z 0 [Note] InnoDB: Starting an apply batch of log records to the database...
2018-01-11T02:16:51.354054Z 0 [Note] InnoDB: Apply batch completed
2018-01-11T02:16:51.354116Z 0 [Note] InnoDB: Last MySQL binlog file position 0 22727779, file name mysql-bin.000561
2018-01-11T02:17:20.284069Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2018-01-11T02:17:20.284111Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2018-01-11T02:17:20.284226Z 0 [Note] InnoDB: Setting file '/app/mysql/data/ibtmp1' size to 12 MB. Physically writing t
he file full; Please wait ...2018-01-11T02:17:20.407973Z 0 [Note] InnoDB: File '/app/mysql/data/ibtmp1' size is now 12 MB.
2018-01-11T02:17:20.409348Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active
.2018-01-11T02:17:20.409367Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2018-01-11T02:17:20.409670Z 0 [Note] InnoDB: Waiting for purge to start
2018-01-11T02:17:20.459812Z 0 [Note] InnoDB: 5.7.16 started; log sequence number 77317291155
2018-01-11T02:17:20.459811Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 39910ms. The settings might not b
e optimal. (flushed=0 and evicted=0, during the time.)2018-01-11T02:17:20.460831Z 0 [Note] InnoDB: Loading buffer pool(s) from /app/mysql/data/ib_buffer_pool
2018-01-11T02:17:20.461277Z 0 [Note] Plugin 'FEDERATED' is disabled.
2018-01-11T02:17:20.509425Z 0 [Note] Recovering after a crash using /app/mysql/logs/mysql-bin
2018-01-11T02:17:20.624436Z 0 [Note] InnoDB: Buffer pool(s) load completed at 180111 10:17:20
2018-01-11T02:17:20.713438Z 0 [Note] Starting crash recovery...
2018-01-11T02:17:20.713548Z 0 [Note] Crash recovery finished.
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
2018-01-11T02:17:21.455677Z 1 [ERROR] Slave SQL for channel '': Error 'Table '…….……' doesn't exist
没错,原来如此,嗖嘎,果然竟然是mysqld出现了崩溃重启的情况,导致该表在创建到一半的时候歇菜了。对MySQL而言这就尴尬了,该表既存在又不存在。
既然搞清楚了状况和原因,现在就是着手修复这个1146的SQL THREAD复制报错。既然这张表是个“半成品”,就索性干死它重新创建吧。
只导表定义不导数据);最后,“ START SLAVE ”启动复制。
简单执行完上面的修复操作后,监控了下,没有报错产生了。之后的工作就是找找mysqld崩溃重启的原因了,有时间再检查和修复Slave的数据一致性。