最近遇到了一个比较奇怪的问题,在大家都在睡午觉的时候,突然手机响了起来,我为了不吵醒其他人拿起了手机看了看监控信息,我去,居然是数据库down了,这是一台运行很久的数据库服务器,当我登进去服务器的时候,尝试重启mysql,但是报(Starting MySQL..... ERROR! The server quit without updating PID file (/usr/local/mysql/data/BigData_ZT_PY_92.pid).)错误,然后就去看错误日志和其他排查方法,就在排查期间突然又来监控告警,提示xxx主机 has just been restarted,我尝试ping一下主机结果ping不通,我当场就懵逼了,服务器无端端的就自己重启了,而且后面连续重启了几次。最后联系机房人员,帮忙连接显示屏查看什么情况。

经过一番折腾,机器终于起来了,我们就开始排查了。查看错误日志发现


InnoDB: End of page dump

2018-05-23 21:10:08 7f6786710700 InnoDB: uncompressed page, stored checksum in field1 2222046951, calculated checksums for field1: crc32 2624418990, innodb 12552

80539, none 3735928559, stored checksum in field2 1914065653, calculated checksums for field2: crc32 2624418990, innodb 3045085343, none 3735928559, page LSN 555

 2748030571, low 4 bytes of LSN at page end 2748030571, page number (if stored to page already) 84692, space id (if created with >= MySQL-4.1.1 and stored alread

y) 2618

InnoDB: Page may be an index page where index id is 8005

InnoDB: Database page corruption on disk or a failed

InnoDB: file read of page 84692.

InnoDB: You may have to recover from a backup.

InnoDB: It is also possible that your operating

InnoDB: system has corrupted its own file cache

InnoDB: and rebooting your computer removes the

InnoDB: error.

InnoDB: If the corrupt page is an index page

InnoDB: you can also try to fix the corruption

InnoDB: by dumping, dropping, and reimporting

InnoDB: the corrupt table. You can use CHECK

InnoDB: TABLE to scan your table for corruption.

InnoDB: See also http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html

InnoDB: about forcing recovery.

InnoDB: Ending processing because of a corrupt database page.

2018-05-23 21:10:08 7f6786710700  InnoDB: Assertion failure in thread 140082613913344 in file buf0buf.cc line 4201

InnoDB: We intentionally generate a memory trap.

InnoDB: Submit a detailed bug report to http://bugs.mysql.com.

InnoDB: If you get repeated assertion failures or crashes, even

InnoDB: immediately after the mysqld startup, there may be

InnoDB: corruption in the InnoDB tablespace. Please refer to

InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html

InnoDB: about forcing recovery.

13:10:08 UTC - mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.


key_buffer_size=8388608

read_buffer_size=131072

max_used_connections=0

max_threads=1024

thread_count=0

connection_count=0

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 415416 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.


Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0 thread_stack 0x40000

63 /usr/local/mysql/bin/mysqld(my_print_stacktrace+0x2c)[0x8f339c]

/usr/local/mysql/bin/mysqld(handle_fatal_signal+0x364)[0x66e3e4]

/lib64/libpthread.so.0(+0xf5e0)[0x7f6b9c5b45e0]

/lib64/libc.so.6(gsignal+0x37)[0x7f6b9b3ba1f7]

/lib64/libc.so.6(abort+0x148)[0x7f6b9b3bb8e8]

/usr/local/mysql/bin/mysqld[0xa9c5c5]

/usr/local/mysql/bin/mysqld[0xadecd6]

/usr/local/mysql/bin/mysqld[0xa400c8]

/lib64/libpthread.so.0(+0x7e25)[0x7f6b9c5ace25]

/lib64/libc.so.6(clone+0x6d)[0x7f6b9b47d34d]

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains

information that should help you find out what is causing the crash.

180523 21:10:09 mysqld_safe mysqld from pid file /usr/local/mysql/data/BigData_ZT_PY_92.pid ended

180523 21:44:59 mysqld_safe Starting mysqld daemon with databases from /usr/local/mysql/data

2018-05-23 21:44:59 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).


以上可以看出点信息就是回滚信息的时候出错了,后来去查了一下资料发现,可能是二进制文件被损坏了。

后来决定使用强制InnoDB恢复,



这里解析下用法:

[mysqld]


innodb_force_recovery = 1


警告

只有在紧急情况下将innodb_force_recovery设为大于0的值,你才能启动InnoDB并转储表。在进行此操作之前,确保你有数据库的备份副本,以备需要重建它。4及以上的值可以永久破坏数据文件。只有在数据库的独立物理副本的成功地测试了设置,才能在生产服务器实例使用4及以上的innodb_force_recovery设置。当强制InnoDB恢复,你应该总是以innodb_force_recovery=1启动,且仅在需要时增加值。

innodb_force_recovery默认为0(没有强制恢复的正常启动)。对于innodb_force_recovery允许的非零值是1至6。较大值包括较小值的功能。例如,为3的值包括所有的值1和2的功能。


如果你能以innodb_force_recovery为3或更低值转储你的表,那么你是比较安全的,只有在损坏的个人页的一些数据会丢失。4或更大的值被认为是危险的,因为数据文件可以被永久地损坏。值6被认为是严重的,数据库页被留在一个陈旧的状态,这反过来又可能带给B-trees和其它数据库结构更多的损坏。

 

作为一个安全措施,InnoDB 在innodb_force_recovery大于0时阻止INSERT,UPDATE或DELETE操作。对于MySQL5.6.15,将innodb_force_recovery设为4或更高会让InnoDB处于只读模式。

1 (SRV_FORCE_IGNORE_CORRUPT)

即使服务器检测到损坏的页仍让它运行。试图使SELECT* FROM tbl_name跳过损坏的索引记录和页,这样有助于转储表。


2 (SRV_FORCE_NO_BACKGROUND)

阻止主线程和任何清除线程的运行。如果崩溃会在清除操作中发生,该恢复值会阻止它。


3 (SRV_FORCE_NO_TRX_UNDO)

不要在崩溃恢复后运行事务回滚。


4 (SRV_FORCE_NO_IBUF_MERGE)

阻止插入缓冲合并操作。如果它们会导致崩溃,不要做这些。不计算表统计。这个值可以永久损坏数据文件。使用这个值后,准备号删除并重建所有辅助索引。在MySQL5.6.15中,设置InnoDB为只读。


5 (SRV_FORCE_NO_UNDO_LOG_SCAN)

在启动数据库时不查看撤消日志:InnoDB将即使未完成的事务也作为已提交。这个值可以永久损坏数据文件。在MySQL5.6.15中,设置InnoDB为只读。


6 (SRV_FORCE_NO_LOG_REDO)

不要通过恢复对重做日志进行前滚。这个值可能永久损坏数据文件。数据库页被留在一个陈旧的状态,这反过来又可能带给B-trees和其它数据库结构更多的损坏。在MySQL5.6.15中,设置InnoDB为只读。 


你可以从表中SELECT来转储它们。innodb_force_recovery的值为3或更低,你可以DROP或CREATE表。在MySQL 5.6.27中,DROP TABLE还受大于3的innodb_force_recovery值支持。


如果你知道一个给定表在回滚造成崩溃,你可以将其删除。如果遇到所造成失败的大规模导入的失控回滚或ALTER TABLE,你可以杀掉mysqld进程,并设置innodb_force_recovery为3使数据库启动而不回滚,然后DROP导致失控回滚的表。


如果表数据中的损坏阻止你转储整个表的内容,带ORDER BY primary_key DESC子句的查询能够转储损坏部分后的表的部分。


如果一个高innodb_force_recovery值需要启动InnoDB,可能有被破坏的数据结构,可能导致复杂查询(含有WHERE,ORDER BY或其他子句的查询)失败。在这种情况下,你可能只能运行基本的SELECT* FROM t查询。




然后启动下数据库:

[root@databases ~]# /etc/init.d/mysql start


启动数据库以后进去数据库show slave status\G;看到从库没起来,然后把/etc/my.cnf文件中innodb_force_recovery = 1注释叼重启数据库就没问题了。


后来排查可能是服务器硬件发生故障,从而使数据库被停止,也可能顺坏了二进制文件。

而且在/etc/my.cnf配置文件里面设置了

innodb_flush_log_at_trx_commit = 1  

# 关键参数,0代表大约每秒写入到日志并同步到磁盘,数据库故障会丢失1秒左右事务数据。1为每执行一条SQL后写入到日志并同步到磁盘,I/O开销大,执行完SQL要等待日志读写,效率低。2代表只把日志写入到系统缓存区,再每秒同步到磁盘,效率很高,如果服务器故障,才会丢失事务数据。


假如设置为1时io性能会很差,所以这台主机只能设置为2.