问题现象描述:
某个集群环境每天出现一台机器关机现象,随机发生,经过排查解决问题,为大家提供方便
环境:
集群环境:openstack + ceph 融合集群,版本:Mitaka+jewel
网络环境:网卡10G+bond0(主备模式)
版 本:centos7.3
message 错误日志:
Aug 30 16:45:14 lxx-4-5 journal: internal error: End of file from monitor Aug 30 16:45:14 lxx-4-5 avahi-daemon[2412]: Withdrawing address record for fe80::fc16:3eff:fef3:5076 on vnet5. Aug 30 16:45:14 lxx-4-5 kernel: vlan206: port 3(vnet5) entered disabled state Aug 30 16:45:14 lxx-4-5 kvm: 10 guests now active Aug 30 16:45:14 lxx-4-5 avahi-daemon[2412]: Withdrawing workstation service for vnet5. Aug 30 16:45:14 lxx-4-5 kernel: device vnet5 left promiscuous mode Aug 30 16:45:14 lxx-4-5 kernel: vlan206: port 3(vnet5) entered disabled state Aug 30 16:45:14 lxx-4-5 systemd: autolog.service holdoff time over, scheduling restart. Aug 30 16:45:14 lxx-4-5 systemd: Started Autolog. Aug 30 16:45:14 lxx-4-5 systemd: Starting Autolog... Aug 30 16:45:14 lxx-4-5 systemd-machined: Machine qemu-22-instance-000002c9 terminated. Aug 30 16:45:14 lxx-4-5 autolog: Don't have master process. Aug 30 16:45:15 l22-4-5 journal: End of file while reading data: Input/output error Aug 30 16:45:15 lxx-4-5 systemd: autolog.service holdoff time over, scheduling restart. Aug 30 16:45:15 lxx-4-5 systemd: Started Autolog. Aug 30 16:45:15 lxx-4-5 systemd: Starting Autolog... Aug 30 16:45:15 lxx-4-5 autolog: Don't have master process. Aug 30 16:45:15 lxx-4-5 systemd: autolog.service holdoff time over, scheduling restart. Aug 30 16:45:15 lxx-4-5 systemd: Started Autolog. Aug 30 16:45:15 lxx-4-5 systemd: Starting Autolog...
openstack-compute 关键日志:
2017-08-30 16:45:20.952 110867 DEBUG nova.compute.manager [req-0602316d-944c-42b4-9d3c-7d1b0e513765 - - - - -] [instance: 26f48b2e-f648-42e2-8133-7ebc060fd7ae] Updated the network info_cache for instance _heal_instance_info_cache /usr/lib/python2.7/site-packages/nova/compute/manager.py:5803 2017-08-30 16:45:30.033 110867 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1504082715.03, 08330b10-f106-4737-b9db-0e45c84abb2e => Stopped> emit_event /usr/lib/python2.7/site-packages/nova/virt/driver.py:1443 2017-08-30 16:45:30.034 110867 INFO nova.compute.manager [-] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] VM Stopped (Lifecycle Event) 2017-08-30 16:45:30.076 110867 DEBUG nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Checking state _get_power_state /usr/lib/python2.7/site-packages/nova/compute/manager.py:1347 2017-08-30 16:45:30.079 110867 DEBUG nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Synchronizing instance power state after lifecycle event "Stopped"; current vm_state: active, current task_state: None, current DB power_state: 1, VM power_state: 4 handle_lifecycle_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:1276 2017-08-30 16:45:30.119 110867 INFO nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] During _sync_instance_power_state the DB power_state (1) does not match the vm_power_state from the hypervisor (4). Updating power_state in the DB to match the hypervisor. 2017-08-30 16:45:30.177 110867 WARNING nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4 2017-08-30 16:45:30.178 110867 DEBUG nova.compute.api [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Going to try to stop instance force_stop /usr/lib/python2.7/site-packages/nova/compute/api.py:1954 2017-08-30 16:45:30.267 110867 DEBUG oslo_concurrency.lockutils [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] Lock "08330b10-f106-4737-b9db-0e45c84abb2e" acquired by "nova.compute.manager.do_stop_instance" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:270 2017-08-30 16:45:30.268 110867 DEBUG nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Checking state _get_power_state /usr/lib/python2.7/site-packages/nova/compute/manager.py:1347 2017-08-30 16:45:30.270 110867 DEBUG nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Stopping instance; current vm_state: active, current task_state: powering-off, current DB power_state: 4, current VM power_state: 4 do_stop_instance /usr/lib/python2.7/site-packages/nova/compute/manager.py:2545 2017-08-30 16:45:30.270 110867 INFO nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Instance is already powered off in the hypervisor when stop is called. 2017-08-30 16:45:30.271 110867 DEBUG nova.objects.instance [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] Lazy-loading 'metadata' on Instance uuid 08330b10-f106-4737-b9db-0e45c84abb2e obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:895 2017-08-30 16:45:30.314 110867 INFO nova.virt.libvirt.driver [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Instance already shutdown. 2017-08-30 16:45:30.318 110867 INFO nova.virt.libvirt.driver [-] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Instance destroyed successfully.
关键日志:
message : Aug 30 16:45:15 l22-4-5 journal: End of file while reading data: Input/output error Openstack-compute: 2017-08-30 16:45:30.034 110867 INFO nova.compute.manager [-] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] VM Stopped (Lifecycle Event)
解决办法:
升级libvirt 版本: libvirt-daemon-driver-secret-2.0.0-10.el7_3.9.x86_64 libvirt-daemon-lxc-2.0.0-10.el7_3.9.x86_64 libvirt-daemon-driver-lxc-2.0.0-10.el7_3.9.x86_64 libvirt-python-2.0.0-2.el7.x86_64 libvirt-daemon-2.0.0-10.el7_3.9.x86_64 libvirt-lock-sanlock-2.0.0-10.el7_3.9.x86_64 libvirt-daemon-driver-storage-2.0.0-10.el7_3.9.x86_64 libvirt-gobject-0.2.3-1.el7.x86_64 libvirt-nss-2.0.0-10.el7_3.9.x86_64 libvirt-daemon-driver-nwfilter-2.0.0-10.el7_3.9.x86_64 libvirt-gconfig-0.2.3-1.el7.x86_64 libvirt-snmp-0.0.3-5.el7.x86_64 libvirt-daemon-driver-nodedev-2.0.0-10.el7_3.9.x86_64 libvirt-glib-devel-0.2.3-1.el7.x86_64 libvirt-gobject-devel-0.2.3-1.el7.x86_64 libvirt-java-javadoc-0.4.9-4.el7.noarch libvirt-daemon-driver-qemu-2.0.0-10.el7_3.9.x86_64 libvirt-daemon-kvm-2.0.0-10.el7_3.9.x86_64 libvirt-gconfig-devel-0.2.3-1.el7.x86_64 libvirt-login-shell-2.0.0-10.el7_3.9.x86_64 libvirt-client-2.0.0-10.el7_3.9.x86_64 libvirt-daemon-driver-interface-2.0.0-10.el7_3.9.x86_64 libvirt-devel-2.0.0-10.el7_3.9.x86_64 libvirt-cim-0.6.3-19.el7.x86_64 libvirt-glib-0.2.3-1.el7.x86_64 libvirt-java-devel-0.4.9-4.el7.noarch libvirt-daemon-driver-network-2.0.0-10.el7_3.9.x86_64 libvirt-docs-2.0.0-10.el7_3.9.x86_64 libvirt-daemon-config-nwfilter-2.0.0-10.el7_3.9.x86_64 libvirt-2.0.0-10.el7_3.9.x86_64 libvirt-daemon-config-network-2.0.0-10.el7_3.9.x86_64 libvirt-java-0.4.9-4.el7.noarch
升级qemu版本 qemu-system-lm32-2.0.0-1.el7.6.x86_64 ipxe-roms-qemu-20160127-5.git6366fa7a.el7.noarch qemu-system-cris-2.0.0-1.el7.6.x86_64 qemu-system-x86-2.0.0-1.el7.6.x86_64 qemu-kvm-tools-1.5.3-126.el7_3.10.x86_64 qemu-system-xtensa-2.0.0-1.el7.6.x86_64 qemu-system-arm-2.0.0-1.el7.6.x86_64 qemu-system-s390x-2.0.0-1.el7.6.x86_64 qemu-system-sh4-2.0.0-1.el7.6.x86_64 qemu-kvm-common-1.5.3-126.el7_3.10.x86_64 qemu-user-2.0.0-1.el7.6.x86_64 qemu-system-unicore32-2.0.0-1.el7.6.x86_64 libvirt-daemon-driver-qemu-2.0.0-10.el7_3.9.x86_64 qemu-guest-agent-2.5.0-3.el7.x86_64 qemu-common-2.0.0-1.el7.6.x86_64 qemu-system-or32-2.0.0-1.el7.6.x86_64 qemu-kvm-1.5.3-126.el7_3.10.x86_64 qemu-system-moxie-2.0.0-1.el7.6.x86_64 qemu-img-1.5.3-126.el7_3.10.x86_64 qemu-system-m68k-2.0.0-1.el7.6.x86_64 qemu-system-alpha-2.0.0-1.el7.6.x86_64 qemu-system-microblaze-2.0.0-1.el7.6.x86_64 qemu-system-mips-2.0.0-1.el7.6.x86_64 qemu-2.0.0-1.el7.6.x86_64
升级kernel [root@~]# rpm -qa|grep kernel kernel-3.10.0-514.26.2.el7.x86_64 kernel-tools-libs-3.10.0-514.26.2.el7.x86_64 kernel-devel-3.10.0-327.36.3.el7.x86_64 kernel-tools-3.10.0-514.26.2.el7.x86_64 kernel-devel-3.10.0-123.el7.x86_64 kernel-3.10.0-327.36.3.el7.x86_64 abrt-addon-kerneloops-2.1.11-45.el7.centos.x86_64 kernel-3.10.0-514.2.2.el7.x86_64 kernel-3.10.0-123.el7.x86_64 kernel-3.10.0-327.22.2.el7.x86_64 kernel-devel-3.10.0-327.22.2.el7.x86_64 kernel-devel-3.10.0-514.26.2.el7.x86_64 kernel-devel-3.10.0-514.2.2.el7.x86_64 kernel-headers-3.10.0-514.26.2.el7.x86_64 [root@~]# uname -r 3.10.0-514.26.2.el7.x86_64
注意:升级版本之后一定要重启,才能成功,重启服务无效!!!