问题描述
曾经一个项目,ceph集群经常出现slow request,问题比较特殊,经过排查可能和某几块磁盘有关,之后踢出这几块慢盘,换上新盘,但是发现依然会存在,并且也不太固定在哪个osd上。另外还有个现象,每周六上午11:00 左右,业务受影响程度较大,slow request的频率也高。
常用命令
./storcli64 /c0 show alilog | grep -i patrol
./storcli64 /c0 show pr
./storcli64 ? | grep -i patrol
./storcli64 /cx stop patrolread
./storcli64 /cx start patrolread
./storcli64 /c0 show cc
MegaCli64 方式查询:
[root@node-4 storcli]# /opt/MegaRAID/MegaCli/MegaCli64 -AdpPR Info -aALL
./storcli64 /c0 show pr
Controller = 0
Status = Success
Description = None
Controller Properties :
=====================
---------------------------------------------
Ctrl_Prop Value
---------------------------------------------
PR Mode Auto
PR Execution Delay 168 hours
PR iterations completed 85
PR Next Start time 04/10/2021, 03:00:00
PR on SSD Disabled
PR on EPD Disabled
PR Current State Stopped
---------------------------------------------
查看PR相关日志
PR:Patrol Read
2798: 19-07-06,03:00:00 Info:Patrol Read started
2799: 19-07-06,03:39:13 BG Work:Patrol Read progress on PD 13(e0x08/s10) is 10.11%(2268s)
2800: 19-07-06,03:39:19 BG Work:Patrol Read progress on PD 0f(e0x08/s4) is 10.11%(2274s)
......
2887: 19-07-06,10:37:55 BG Work:Patrol Read progress on PD 0a(e0x08/s2) is 90.84%(26382s)
2888: 19-07-06,11:22:07 BG Work:Patrol Read progress on PD 10(e0x08/s8) is 96.90%(28926s)
2889: 19-07-06,11:51:59 Info:Patrol Read complete
查看CC相关日志
CC:Consistency Check
03/06/21 3:00:00: C0:EVT#16096-03/06/21 3:00:00: 66=Consistency Check started on VD 00/0^M
03/06/21 3:00:00: C0:EVT#16097-03/06/21 3:00:00: 66=Consistency Check started on VD 01/1^M
03/06/21 3:00:00: C0:ld sync: all LDs sync'd^M
03/06/21 3:00:16: C0:EVT#16098-03/06/21 3:00:16: 65=Consistency Check progress on VD 00/0 is 1.99%(16s)^M
03/06/21 3:00:32: C0:EVT#16099-03/06/21 3:00:32: 65=Consistency Check progress on VD 00/0 is 3.99%(32s)^M
03/06/21 3:00:55: C0:EVT#16100-03/06/21 3:00:55: 65=Consistency Check progress on VD 00/0 is 6.99%(55s)^M
......
03/08/21 5:45:59: C0:EVT#16244-03/08/21 5:45:59: 65=Consistency Check progress on VD 01/1 is 97.94%(182757s)^M
03/08/21 6:13:46: C0:EVT#16245-03/08/21 6:13:46: 65=Consistency Check progress on VD 01/1 is 98.94%(184424s)^M
03/08/21 6:40:31: C0:EVT#16246-03/08/21 6:40:31: 65=Consistency Check progress on VD 01/1 is 99.94%(186029s)^M
03/08/21 6:41:52: C0:EVT#16247-03/08/21 6:41:52: 58=Consistency Check done on VD 01/1^M
03/08/21 6:41:52: C0:ccScheduleSetNextStartTime: RTC_TimeStamp=27d883b0, nextStartTime=27dee730^M
03/08/21 6:41:52: C0:Next cc scheduled to start at 03/13/21 3:00:00^M
03/08/21 6:41:52: C0:CC Schedule cycle complete^M
问题结论
和raid卡型号以及PR操作有关,每周轮询一次PR 读操作,该操作会持续较长时间导致磁盘读性能非常差,await值很高,最终可能出现slowrequest。
1. SAS-3 3108 型号的节点没有出现slow request,SAS-3 3508 型号的节点出现slow request。
2. 都有做PR操作,但是根据日志看, SAS-3 3108 型号的节点持续一个月左右PR操作才做完,会优先业务IO操作,而SAS-3 3508 型号的节点,1-3天左右就做完了,立即执行操作,所以对SAS-3 3508 型号的节点压力较大。
3. SAS-3 3508 型号的节点,做PR时候,对日立磁盘影响较大,对希捷磁盘影响较小;但是不做PR是希捷性能会更差。希捷磁盘的性能本来就比日立差。
4. 关闭PR操作,问题得到改善。