1. 查看慢sql执行计划 

 Explain 慢sql,查看执行计划,有索引,扫描一百多万行(客户反映之前有三百多万行),有临时内存表数据存储,有排序, 执行时间1.7秒。不是太慢的sql。

 2.Show processlist 

 查看实时进程,没有停留太久的线程,资源宽裕。不是问题发生时间段的进程情况,无法判断。 

 3.查看错误日志 

2021-06-10T10:36:50.342681+08:00 742 [ERROR] Event Scheduler: [cccc@%][vvvv.mmdegggg] Lock wait timeout exceeded; try restarting transaction
2021-06-10T10:36:50.342728+08:00 742 [Note] Event Scheduler: [cccc@%].[vvvv.mmdegggg] event execution failed.
2021-06-10T10:46:51.193861+08:00 771 [ERROR] Event Scheduler: [cccc@%][vvvv.mmdegggg] Lock wait timeout exceeded; try restarting transaction
2021-06-10T10:46:51.193904+08:00 771 [Note] Event Scheduler: [cccc@%].[vvvv.mmdegggg] event execution failed.
2021-06-10T10:54:46.201843+08:00 776 [ERROR] Event Scheduler: [cccc@%][vvvv.mmdegggg] Query execution was interrupted
2021-06-10T10:54:46.201897+08:00 776 [ERROR] Event Scheduler: [cccc@%][vvvv.mmdegggg] Got error 168 from storage engine
2021-06-10T10:54:46.201904+08:00 776 [Note] Event Scheduler: [cccc@%].[vvvv.mmdegggg] event execution failed.
2021-06-10T11:01:51.063760+08:00 831 [ERROR] Event Scheduler: [cccc@%][vvvv.mmdegggg] Lock wait timeout exceeded; try restarting transaction
2021-06-10T11:01:51.063809+08:00 831 [Note] Event Scheduler: [cccc@%].[vvvv.mmdegggg] event execution failed.
2021-06-10T11:11:51.226952+08:00 894 [ERROR] Event Scheduler: [cccc@%][vvvv.mmdegggg] Lock wait timeout exceeded; try restarting transaction
2021-06-10T11:11:51.226995+08:00 894 [Note] Event Scheduler: [cccc@%].[vvvv.mmdegggg] event execution failed. 。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
2021-06-10T11:31:55.680890+08:00 0 [Note] Giving 205 client threads a chance to die gracefully
2021-06-10T11:31:55.684323+08:00 0 [Note] Shutting down slave threads
2021-06-10T11:31:57.686861+08:00 0 [Note] Forcefully disconnecting 109 remaining clients

 事发时间段内,记录有大量锁等待占用,资源阻塞,锁超时。

『故障分析』 

 定时任务Event Scheduler: [cccc@%].[vvvv.mmdegggg]遇到请求的资源被锁定,发生锁等待,等待时间(innodb_lock_wait_timeout值)超过后,事务中断回滚。事务等待过程中,会阻塞后续其他请求的资源。而定时任务多次调用,多次拥堵。直到11:31 mysql 被重启,冲突资源释放,后续日志再没看到发生锁等待。 

『故障规避』 

 查看定时任务mmdegggg的脚本,看看调用的是哪些sql;查看问题发生时间前业务或者人为执行了什么跑批事务或者大事务,大表ddl等,查看问题发生时间段的慢日志里面出现的sql,分析资源占用冲突关系。 

『紧急处理』

 遇到锁等待,紧急操作,就是找到阻塞的进程id,kill id,释放资源。 

 1.查看当时mysql线程会话情况 

SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST where time>10 and COMMAND<>'Binlog Dump' order by TIME desc limit 20;

 超长执行时间的sql或者事务,都会持续占用资源。若是发现字段state里面有lock wait,meta data lock,这类的,就是有锁等待发生。 

 2.查看INNODB_LOCK_WAITS 表

SELECT * FROM INFORMATION_SCHEMA.INNODB_LOCK_WAITS;

 可以看到当时阻塞事务的id, blocking_trx_id即是对应的事务id 

 3.查看INNODB_LOCK 表 

SELECT * FROM INFORMATION_SCHEMA.INNODB_LOCK;

 可以查出详细的锁信息 

 4.查看INNODB_TRX 表 

SELECT * FROM INFORMATION_SCHEMA.INNODB_TRX where trx_id = "上面获得的blocking_trx_id";

 可以查出详细的事务信息,trx_mysql_thread_id即是对应的线程ID 

 5.kill 线程

select * from information_schema.`PROCESSLIST` where ID = 

上面获得的trx_mysql_thread_id 可以查到线程的详细信息,可以通过 kill xxx来终止线程从而强行释放锁

 kill trx_mysql_thread_id;