线上遇到的让MySQL直接crash的bug

原创

mb5fd86a050ef28 2021-03-05 20:55:49 ©著作权

文章标签 java 文章分类 Java 后端开发

©著作权归作者所有：来自51CTO博客作者mb5fd86a050ef28的原创作品，请联系作者获取转载授权，否则将追究法律责任

简短地描述：

过程1：机器无缘故异常宕机。

过程2：机器重启之后发现起不来。

过程3：然后更换硬件，更换了cpu之后机器起来了。

过程4：然后作者开心的把数据库起来了。然后登陆数据库，妥妥地，没有毛病。

过程5：该数据库之前是主库，机器宕机之后，自动发生主从切换了。所以，准备验证一下数据，验证新主库跟老主库之间的数据是否一致，新主库是否丢失数据。验证完毕之后，新主库接管时没丢失数据，妥妥滴。

过程6：因为数据没多没少，所以准备直接将该数据库作为新主库的从库。所以执行了change master 命令。

过程7：执行start slave 命令，然后瞬间发现mysqld 狗带了，自动重启。虽然是在mysql一线运维（干苦力）很多年的老dba, 但这种情况还真是蛮少遇到滴--因为机器宕机直接把mysql数据库搞歇菜。

然后看mysqld 的error log .

2018-09-06T18:33:47.475065+08:00 5 [Warning] Slave SQL for channel '': If a crash happens this configuration does not guarantee that the relay log info will be consistent, Error_code: 0
2018-09-06T18:33:47.475172+08:00 5 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'mysql-bin.000005' at position 1063042206, relay log '/mysqldata/myinst1/binlog/relay-log.000002' position: 425121
2018-09-06T18:33:47.478127+08:00 5 [Note] Slave for channel '': MTS Recovery has completed at relay log /mysqldata/myinst1/binlog/relay-log.000002, position 473555 master log mysql-bin.000005, position 1063090640.
2018-09-06T18:33:52.516543+08:00 6 [Warning] Timeout waiting for reply of binlog (file: mysql-bin.000015, pos: 7595), semi-sync up to file , position 0.
2018-09-06T18:33:52.516580+08:00 6 [Note] Semi-sync replication switched OFF.
2018-09-06 18:33:52 0x7fbce99d4700  InnoDB: Assertion failure in thread 140449349977856 in file fut0lst.ic line 85
InnoDB: Failing assertion: addr.page == FIL_NULL || addr.boffset >= FIL_PAGE_DATA
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
10:33:52 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=268435456
read_buffer_size=8388608
max_used_connections=1
max_threads=2000
thread_count=23
connection_count=1
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 20768831 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

上面的日志一大堆，但有用的信息就两行：

InnoDB: Assertion failure in thread 140449349977856 in file fut0lst.ic line 85

InnoDB: Failing assertion: addr.page == FIL_NULL || addr.boffset >= FIL_PAGE_DATA

根据 in file fut0lst.ic line 85 找到下面的函数：

Reads a file address.

@return file address */

UNIV_INLINE

fil_addr_t

flst_read_addr(

/*===========*/

const fil_faddr_t* faddr, /*!< in: pointer to file

faddress */

mtr_t* mtr) /*!< in: mini-transaction handle

{

fil_addr_t addr;

ut_ad(faddr && mtr);

addr.page = mtr_read_ulint(faddr + FIL_ADDR_PAGE, MLOG_4BYTES,

mtr);

addr.boffset = mtr_read_ulint(faddr + FIL_ADDR_BYTE,

MLOG_2BYTES,

mtr);

ut_a(addr.page == FIL_NULL || addr.boffset >= FIL_PAGE_DATA);

ut_a(ut_align_offset(faddr, UNIV_PAGE_SIZE) >= FIL_PAGE_DATA);

return(addr);

}

问题出在“ ut_a(addr.page == FIL_NULL || addr.boffset >= FIL_PAGE_DATA);“ 这里。因获取到的addr 信息，不满足上面的条件。

为啥获取的文件地址跟需要的有差异了？可能是服务器宕机时，破坏了这个一致性，问题在哪里？

继续捋代码。

/********************************************************************//**

Writes a file address. */

UNIV_INLINE

void

flst_write_addr(

/*============*/

fil_faddr_t* faddr, /*!< in: pointer to file faddress */

fil_addr_t addr, /*!< in: file address */

mtr_t* mtr) /*!< in: mini-transaction handle */

{

ut_ad(faddr && mtr);

ut_ad(mtr_memo_contains_page_flagged(mtr, faddr,

MTR_MEMO_PAGE_X_FIX

| MTR_MEMO_PAGE_SX_FIX));

ut_a(addr.page == FIL_NULL || addr.boffset >= FIL_PAGE_DATA);

ut_a(ut_align_offset(faddr, UNIV_PAGE_SIZE) >= FIL_PAGE_DATA);

mlog_write_ulint(faddr + FIL_ADDR_PAGE, addr.page, MLOG_4BYTES,

mtr);

mlog_write_ulint(faddr + FIL_ADDR_BYTE, addr.boffset,

MLOG_2BYTES, mtr);

}

问题在上面这个函数中的这两行，当执行完

mlog_write_ulint(faddr + FIL_ADDR_PAGE, addr.page, MLOG_4BYTES,

mtr); 这行代码，而下一行还没有执行时，服务器就宕机了。则这个faddr 记录的信息就不完整了，导致了上面的 ut_a(addr.page == FIL_NULL || addr.boffset >= FIL_PAGE_DATA); 判断不通过，造成mysqld crash .

如果没有搭建从库，也没有备份，大家会如何处理？请说说呗！

上一篇：基于clickhouse分析和优化mysql的业务运行

下一篇：脉脉的mysql统计工具及通过统计手段处理mysql问题

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯