目前我这里docker是运行在centos 7.0系统里,使用1.5版本docker,最近一台服务器总是不定期死机,通过查看日志发现属于内核bug导致,报错信息如下




1


2


3


4


5


6


7


8


9


10


11


12


13


14


15


16


17


18


19


20


21


22


23


24


25


26


27


28


29


30


31


32


33


34


35


36


37


38


39


40


41


42


43


44


45


46


47


48


49


50


51


52


53


54


55


56


57


58


59


60


61


62


63


64


65


66


67


68


69


70


71


72


73


74


75


76


77


78


79


80


81


82


83




​May 11 03:43:08 ip-10-10-29-201 kernel: BUG: soft lockup - CPU​​​​#4 stuck for 22s! [handler20:1542]​


​May 11 03:43:08 ip-10-10-29-201 kernel: Modules linked ​​​​in​​​​: iptable_nat nf_nat_ipv4 iptable_filter ip_tables binfmt_misc ipmi_si vfat fat usb_storage mpt3sas mpt2sas raid_​


​class scsi_transport_sas mptctl mptbase dell_rbu tcp_diag inet_diag veth bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio loop dm_mod openvswitch vxl​


​an ip_tunnel gre libcrc32c xt_nat ipt_MASQUERADE xt_addrtype nf_nat xt_limit ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport xt_conntrack sg nf_conntrack ipmi_de​


​vintf iTCO_wdt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_​


​helper cryptd pcspkr sb_edac edac_core ses enclosure ipmi_msghandler tg3 wmi acpi_power_meter ptp pps_core mei_me mei ntb lpc_ich mperf mfd_core shpchp ext4​


​May 11 03:43:08 ip-10-10-29-201 kernel: mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper​


​ttm ahci drm libahci libata i2c_core megaraid_sas [last unloaded: ip_tables]​


​May 11 03:43:08 ip-10-10-29-201 kernel: CPU: 4 PID: 1542 Comm: handler20 Tainted: G        W   --------------   3.10.0-123.el7.x86_64 ​​​​#1​


​May 11 03:43:08 ip-10-10-29-201 kernel: Hardware name: Dell Inc. PowerEdge R720​​​​/0X6FFV​​​​, BIOS 1.6.0 03​​​​/07/2013​


​May 11 03:43:08 ip-10-10-29-201 kernel: task: ffff880418adf1c0 ti: ffff8800c8d08000 task.ti: ffff8800c8d08000​


​May 11 03:43:08 ip-10-10-29-201 kernel: RIP: 0010:[<ffffffff815e90e7>]  [<ffffffff815e90e7>] _raw_spin_lock+0x37​​​​/0x50​


​May 11 03:43:08 ip-10-10-29-201 kernel: RSP: 0018:ffff88041fc43ac8  EFLAGS: 00000206​


​May 11 03:43:08 ip-10-10-29-201 kernel: RAX: 000000000000108b RBX: 0000000000000000 RCX: 0000000000000000​


​May 11 03:43:08 ip-10-10-29-201 kernel: RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffff88081609c318​


​May 11 03:43:08 ip-10-10-29-201 kernel: RBP: ffff88041fc43ac8 R08: ffff8801049856d8 R09: ffff88041fc43a00​


​May 11 03:43:08 ip-10-10-29-201 kernel: R10: 0000000000000000 R11: 00000000e1bec8f9 R12: ffff88041fc43a38​


​May 11 03:43:08 ip-10-10-29-201 kernel: R13: ffffffff815f2d9d R14: ffff88041fc43ac8 R15: ffff88081609c300​


​May 11 03:43:08 ip-10-10-29-201 kernel: FS:  00007fb082b8b700(0000) GS:ffff88041fc40000(0000) knlGS:0000000000000000​


​May 11 03:43:08 ip-10-10-29-201 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033​


​May 11 03:43:08 ip-10-10-29-201 kernel: CR2: 00007f2a743e6000 CR3: 00000008183c9000 CR4: 00000000000407e0​


​May 11 03:43:08 ip-10-10-29-201 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000​


​May 11 03:43:08 ip-10-10-29-201 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400​


​May 11 03:43:08 ip-10-10-29-201 kernel: Stack:​


​May 11 03:43:08 ip-10-10-29-201 kernel: ffff88041fc43af8 ffffffffa042429f ffff88003714be00 ffffe8fbefc41540​


​May 11 03:43:08 ip-10-10-29-201 kernel: ffff880419070e80 ffff88041fc43b30 ffff88041fc43be0 ffffffffa04239a4​


​May 11 03:43:08 ip-10-10-29-201 kernel: 00000001b9ec8070 ffff88003714be00 ffff88041fc43b28 0000000000000246​


​May 11 03:43:08 ip-10-10-29-201 kernel: Call Trace:​


​May 11 03:43:08 ip-10-10-29-201 kernel: <IRQ>​


​May 11 03:43:08 ip-10-10-29-201 kernel:​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffffa042429f>] ovs_flow_stats_update+0x4f​​​​/0xd0​​ ​​[openvswitch]​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffffa04239a4>] ovs_dp_process_received_packet+0x84​​​​/0x120​​ ​​[openvswitch]​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffffa042a01a>] ovs_vport_receive+0x2a​​​​/0x30​​ ​​[openvswitch]​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffffa042b4cd>] vxlan_rcv+0x6d​​​​/0x90​​ ​​[openvswitch]​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffffa037b228>] vxlan_udp_encap_recv+0xb8​​​​/0x130​​ ​​[vxlan]​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff81538bc2>] udp_queue_rcv_skb+0x162​​​​/0x3d0​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff815394bd>] __udp4_lib_rcv+0x19d​​​​/0x690​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff815094d0>] ? ip_rcv_finish+0x350​​​​/0x350​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff815399ca>] udp_rcv+0x1a​​​​/0x20​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff81509584>] ip_local_deliver_finish+0xb4​​​​/0x1f0​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff81509858>] ip_local_deliver+0x48​​​​/0x80​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff815091fd>] ip_rcv_finish+0x7d​​​​/0x350​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff81509ac4>] ip_rcv+0x234​​​​/0x380​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814cfdb6>] __netif_receive_skb_core+0x676​​​​/0x870​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814cffc8>] __netif_receive_skb+0x18​​​​/0x60​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814d0b7e>] process_backlog+0xae​​​​/0x180​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814d041a>] net_rx_action+0x15a​​​​/0x250​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff81067047>] __do_softirq+0xf7​​​​/0x290​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff815f3a5c>] call_softirq+0x1c​​​​/0x30​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff81014d25>] do_softirq+0x55​​​​/0x90​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff810673e5>] irq_exit+0x115​​​​/0x120​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff815f4358>] do_IRQ+0x58​​​​/0xf0​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff815e94ad>] common_interrupt+0x6d​​​​/0x6d​


​May 11 03:43:08 ip-10-10-29-201 kernel: <EOI>​


​May 11 03:43:08 ip-10-10-29-201 kernel:​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffffa0424465>] ? ovs_flow_stats_get+0x145​​​​/0x180​​ ​​[openvswitch]​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffffa0424453>] ? ovs_flow_stats_get+0x133​​​​/0x180​​ ​​[openvswitch]​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffffa04217b7>] ovs_flow_cmd_fill_info+0x1c7​​​​/0x320​​ ​​[openvswitch]​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffffa0421c5c>] ovs_flow_cmd_build_info.constprop.25+0x6c​​​​/0xa0​​ ​​[openvswitch]​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffffa0422155>] ovs_flow_cmd_new_or_set+0x4c5​​​​/0x520​​ ​​[openvswitch]​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff8108ec58>] ? __wake_up_common+0x58​​​​/0x90​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814ffcd8>] genl_family_rcv_msg+0x258​​​​/0x3d0​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814ffe50>] ? genl_family_rcv_msg+0x3d0​​​​/0x3d0​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814ffee1>] genl_rcv_msg+0x91​​​​/0xd0​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814fdf99>] netlink_rcv_skb+0xa9​​​​/0xc0​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814fe4c8>] genl_rcv+0x28​​​​/0x40​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814fd5bd>] netlink_unicast+0xed​​​​/0x1b0​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814fd9a7>] netlink_sendmsg+0x327​​​​/0x760​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814fa874>] ? netlink_rcv_wake+0x44​​​​/0x60​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814fb92b>] ? netlink_recvmsg+0x1cb​​​​/0x3e0​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814b79b0>] sock_sendmsg+0xb0​​​​/0xf0​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814b807f>] ? sock_recvmsg+0xbf​​​​/0x100​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff8109b23e>] ? task_scan_min+0x3e​​​​/0x60​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff815e908b>] ? _raw_spin_unlock_bh+0x1b​​​​/0x40​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff814b7de9>] ___sys_sendmsg+0x3a9​​​​/0x3c0​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff811f7fa9>] ? ep_scan_ready_list.isra.9+0x1b9​​​​/0x1f0​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff811f8123>] ? ep_poll+0x123​​​​/0x370​


​May 11 03:43:08 ip-10-10-29-201 kernel: [<ffffffff81079af3>] ? getrusage+0x43​​​​/0x70​


​May 11 03:43:09 ip-10-10-29-201 kernel: [<ffffffff814b8cd1>] __sys_sendmsg+0x51​​​​/0x90​


​May 11 03:43:09 ip-10-10-29-201 kernel: [<ffffffff814b8d22>] SyS_sendmsg+0x12​​​​/0x20​


​May 11 03:43:09 ip-10-10-29-201 kernel: [<ffffffff815f2119>] system_call_fastpath+0x16​​​​/0x1b​


​May 11 03:43:09 ip-10-10-29-201 kernel: Code: 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7 f2 b8 00 80 00 00 eb 0c 0f 1f 44 00 00 f3 90 83 e8 01 7​


​4 0a <0f> b7 0f 66 39 ca 75 f1 5d c3 66 66 66 90 66 66 90 eb da 66 0f​



 通过在stackoverflow查询发现此问题属于内核bug,解决方法是升级内核。

下面是把centos 7.0默认3.10版本内核升级为4.0.2版本过程

1、导入yum源的认证key




1




​rpm --​​​​import​​ ​​https:​​​​//www​​​​.elrepo.org​​​​/RPM-GPG-KEY-elrepo​​​​.org​



2、安装yum源




1




​rpm -Uvh http:​​​​//www​​​​.elrepo.org​​​​/elrepo-release-7​​​​.0-2.el7.elrepo.noarch.rpm​



3、安装新内核

在yum的ELRepo源中,有mainline(4.0.2)这个内核版本




1


2


3


4


5


6


7


8


9


10


11


12


13


14


15


16


17


18


19


20


21


22


23


24


25


26


27


28


29


30


31


32


33


34


35


36


37


38


39


40


41


42


43


44


45


46


47


48


49


50


51


52


53


54


55


56


57


58


59




​[root@ip-10-10-29-201 ~]​​​​# yum --enablerepo=elrepo-kernel install  kernel-ml-devel kernel-ml​


​Loaded plugins: fastestmirror​


​MooseFS                                                                                                                                            |  951 B  00:00:00​


​base                                                                                                                                               | 3.6 kB  00:00:00​


​elrepo                                                                                                                                             | 2.9 kB  00:00:00​


​elrepo-kernel                                                                                                                                      | 2.9 kB  00:00:00​


​extras                                                                                                                                             | 3.4 kB  00:00:00​


​updates                                                                                                                                            | 3.4 kB  00:00:00​


​(1​​​​/2​​​​): elrepo​​​​/primary_db​​                                                                                                                           ​​| 233 kB  00:00:02​


​(2​​​​/2​​​​): elrepo-kernel​​​​/primary_db​​                                                                                                                    ​​| 782 kB  00:00:04​


​MooseFS​​​​/primary​​                                                                                                                                    ​​| 4.2 kB  00:00:00​


​Loading mirror speeds from cached hostfile​


​* base: mirrors.yun-idc.com​


​* elrepo: repos.lax-noc.com​


​* elrepo-kernel: repos.lax-noc.com​


​* extras: mirror.bit.edu.cn​


​* updates: mirror.bit.edu.cn​


​MooseFS                                                                                                                                                             30​​​​/30​


​Resolving Dependencies​


​--> Running transaction check​


​---> Package kernel-ml.x86_64 0:4.0.2-1.el7.elrepo will be installed​


​---> Package kernel-ml-devel.x86_64 0:4.0.2-1.el7.elrepo will be installed​


​--> Finished Dependency Resolution​


 


​Dependencies Resolved​


 


​==========================================================================================================================================================================​


​Package                                   Arch                             Version                                         Repository                               Size​


​==========================================================================================================================================================================​


​Installing:​


​kernel-ml                                 x86_64                           4.0.2-1.el7.elrepo                              elrepo-kernel                            36 M​


​kernel-ml-devel                           x86_64                           4.0.2-1.el7.elrepo                              elrepo-kernel                           9.5 M​


 


​Transaction Summary​


​==========================================================================================================================================================================​


​Install  2 Packages​


 


​Total download size: 45 M​


​Installed size: 199 M​


​Is this ok [y​​​​/d/N​​​​]: y​


​Downloading packages:​


​(1​​​​/2​​​​): kernel-ml-4.0.2-1.el7.elrepo.x86_64.rpm                                                                                                     |  36 MB  00:00:11​


​(2​​​​/2​​​​): kernel-ml-devel-4.0.2-1.el7.elrepo.x86_64.rpm                                                                                               | 9.5 MB  00:00:31​


​--------------------------------------------------------------------------------------------------------------------------------------------------------------------------​


​Total                                                                                                                                     1.5 MB​​​​/s​​ ​​|  45 MB  00:00:31​


​Running transaction check​


​Running transaction ​​​​test​


​Transaction ​​​​test​​ ​​succeeded​


​Running transaction​


​Warning: RPMDB altered outside of yum.​


​Installing : kernel-ml-devel-4.0.2-1.el7.elrepo.x86_64                                                                                                              1​​​​/2​


​Installing : kernel-ml-4.0.2-1.el7.elrepo.x86_64                                                                                                                    2​​​​/2​


​Verifying  : kernel-ml-4.0.2-1.el7.elrepo.x86_64                                                                                                                    1​​​​/2​


​Verifying  : kernel-ml-devel-4.0.2-1.el7.elrepo.x86_64                                                                                                              2​​​​/2​


 


​Installed:​


​kernel-ml.x86_64 0:4.0.2-1.el7.elrepo                                            kernel-ml-devel.x86_64 0:4.0.2-1.el7.elrepo​


 


​Complete!​



4、查看当前内核版本




1


2




​[root@ip-10-10-29-201 ~]​​​​# uname -r​


​3.10.0-123.el7.x86_64​



重要:目前内核还是默认的版本,如果在这一步完成后你就直接reboot了,重启后使用的内核版本还是默认的3.10,不会使用新的4.0.2,想修改启动的顺序,需要进行下一步

查看默认启动顺序




1


2


3


4




​[root@ip-10-10-29-201 ~]​​​​# awk -F\' '$1=="menuentry " {print $2}' /etc/grub2.cfg​


​CentOS Linux (4.0.2-1.el7.elrepo.x86_64) 7 (Core)​


​CentOS Linux, with Linux 3.10.0-123.el7.x86_64​


​CentOS Linux, with Linux 0-rescue-18b184aa09434ecf9739a70c6b63638a​



默认启动的顺序是从0开始,但我们新内核是从头插入(目前位置在1,而4.0.2的是在0),所以需要选择0,如果想生效最新的内核,需要




1




​[root@ip-10-10-29-201 ~]​​​​# grub2-set-default 0​



5、重启




1




​Reboot​



6、重启后查看内核




1


2




​[root@ip-10-10-29-201 conf]​​​​# uname -r​


​4.0.2-1.el7.elrepo.x86_64​



经过升级后,20天没有出现此问题,所以判断此次文件为内核bug引起,通过升级内核解决。