linux中的一个bug
最近测试用的一台虚拟机经常出现连接不上的问题。通过ssh,vnc各种办法都没有效果,
到终端上,一看机器是看这,看起来运行良好,使用ps查看命令后,没有办法显示完全。
为了保证测试人员的使用,先重启了集群,好了。接下来再慢慢看,
查看系统日志后发现如下记录:
Feb  6 14:13:02 localhost kernel: INFO: task khugepaged:28 blocked for more than 120 seconds.
Feb 6 14:13:02 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 6 14:13:02 localhost kernel: khugepaged D ffff88003b06da78 0 28 2 0x00000000
Feb 6 14:13:02 localhost kernel: ffff88003af69c78 0000000000000046 ffffffff81059d12 0000000000000000
Feb 6 14:13:02 localhost kernel: 0000000000016980 0000000000000000 ffff88003b06da00 000000010100c183
Feb 6 14:13:02 localhost kernel: ffff88003aefe5f8 ffff88003af69fd8 0000000000010518 ffff88003aefe5f8
Feb 6 14:13:02 localhost kernel: Call Trace:
Feb 6 14:13:02 localhost kernel: [<ffffffff81059d12>] ? finish_task_switch+0x42/0xd0
Feb 6 14:13:02 localhost kernel: [<ffffffff8107d5ac>] ? lock_timer_base+0x3c/0x70
Feb 6 14:13:02 localhost kernel: [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
Feb 6 14:13:02 localhost kernel: [<ffffffff8107e0b2>] ? del_timer_sync+0x22/0x30
Feb 6 14:13:02 localhost kernel: [<ffffffff814ca846>] rwsem_down_read_failed+0x26/0x30
Feb 6 14:13:02 localhost kernel: [<ffffffff81264224>] call_rwsem_down_read_failed+0x14/0x30
Feb 6 14:13:02 localhost kernel: [<ffffffff814c9d44>] ? down_read+0x24/0x30
Feb 6 14:13:02 localhost kernel: [<ffffffff81164a72>] khugepaged+0x1b2/0x1190
Feb 6 14:13:02 localhost kernel: [<ffffffff81091ca0>] ? autoremove_wake_function+0x0/0x40
Feb 6 14:13:02 localhost kernel: [<ffffffff811648c0>] ? khugepaged+0x0/0x1190
Feb 6 14:13:02 localhost kernel: [<ffffffff81091936>] kthread+0x96/0xa0
Feb 6 14:13:02 localhost kernel: [<ffffffff810141ca>] child_rip+0xa/0x20
Feb 6 14:13:02 localhost kernel: [<ffffffff810918a0>] ? kthread+0x0/0xa0
Feb 6 14:13:02 localhost kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20
Feb 6 14:13:02 localhost kernel: INFO: task fuse_dfs:1593 blocked for more than 120 seconds.
Feb 6 14:13:02 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 6 14:13:02 localhost kernel: fuse_dfs D ffffffff8110c060 0 1593 1 0x00000080
Feb 6 14:13:02 localhost kernel: ffff88003cd599b8 0000000000000086 00000000ffffffff 00010bb591837fe6
Feb 6 14:13:02 localhost kernel: 0000000000000000 ffff88003b4b9980 0000000000266bba ffffffffaeee4f12
Feb 6 14:13:02 localhost kernel: ffff88003cd550a8 ffff88003cd59fd8 0000000000010518 ffff88003cd550a8
Feb 6 14:13:02 localhost kernel: Call Trace:
Feb 6 14:13:02 localhost kernel: [<ffffffff8109b9a9>] ? ktime_get_ts+0xa9/0xe0
Feb 6 14:13:02 localhost kernel: [<ffffffff8110c060>] ? sync_page+0x0/0x50
Feb 6 14:13:02 localhost kernel: [<ffffffff814c8a23>] io_schedule+0x73/0xc0
Feb 6 14:13:02 localhost kernel: [<ffffffff8110c09d>] sync_page+0x3d/0x50
Feb 6 14:13:02 localhost kernel: [<ffffffff814c914a>] __wait_on_bit_lock+0x5a/0xc0
Feb 6 14:13:02 localhost kernel: [<ffffffff8110c037>] __lock_page+0x67/0x70
Feb 6 14:13:02 localhost kernel: [<ffffffff81091ce0>] ? wake_bit_function+0x0/0x50
Feb 6 14:13:02 localhost kernel: [<ffffffff8115b730>] lock_page+0x30/0x40
Feb 6 14:13:02 localhost kernel: [<ffffffff8115bdad>] migrate_pages+0x59d/0x5d0
Feb 6 14:13:02 localhost kernel: [<ffffffff811223b7>] ? ____pagevec_lru_add+0x167/0x180
Feb 6 14:13:02 localhost kernel: [<ffffffff81152470>] ? compaction_alloc+0x0/0x370
Feb 6 14:13:02 localhost kernel: [<ffffffff81151f1c>] compact_zone+0x4ac/0x5e0
Feb 6 14:13:02 localhost kernel: [<ffffffff8111cd1c>] ? get_page_from_freelist+0x15c/0x820
Feb 6 14:13:02 localhost kernel: [<ffffffff811522ce>] compact_zone_order+0x7e/0xb0
Feb 6 14:13:02 localhost kernel: [<ffffffff81152409>] try_to_compact_pages+0x109/0x170
Feb 6 14:13:02 localhost kernel: [<ffffffff8111e62c>] __alloc_pages_nodemask+0x55c/0x810
Feb 6 14:13:02 localhost kernel: [<ffffffff8113eb95>] ? page_add_new_anon_rmap+0xb5/0xd0
Feb 6 14:13:02 localhost kernel: [<ffffffff81150374>] alloc_pages_vma+0x84/0x110
Feb 6 14:13:02 localhost kernel: [<ffffffff8113ef50>] ? anon_vma_prepare+0x30/0x160
Feb 6 14:13:02 localhost kernel: [<ffffffff811673b5>] do_huge_pmd_anonymous_page+0x135/0x360
Feb 6 14:13:02 localhost kernel: [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20
Feb 6 14:13:02 localhost kernel: [<ffffffff81136455>] handle_mm_fault+0x245/0x2b0
看起来是由于一个内核级别的错误引起的。仔细看了下,所有的错误都是先由khugepaged进程挂起引起的。网上查看资料是由于内核锁造成的。
系统处于一个假死的状态,对于内核我也不熟悉,网上查查资料先,
在网上查到一个关于khugepaged的一个bug信息,链接http://bugs.centos.org/view.php?id=5716
暂时想不到别的办法,先关闭khugepaged试试看了。
echo no > /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag