问题背景:

运行SSVM跟CPVM的物理机发生宕机,查看SSVM跟CPVM状态仍旧为 Running, 所在主机仍旧显示为宕机物理机,于是将该物理机启动成功,登录物理机通过virsh list --all 命令查看SSVM跟 CPVM是否确实运行成功,发现并没有,再查询所有物理机,发现依旧没有发现 SSVM跟 CPVM的虚机,然而CloudStack的UI界面显示SSVM跟CPVM一直为Running,也显示运行在该主机上面,当然Ping不通其IP地址,于是想将SSVM 跟  CPVM 删除,但是都不行,连停止操作都失败,但是竟然可以顺利创建实例,简直就是一个BIG BUG!

日志信息: /var/log/cloudstack/management/management-server.log
2013-12-17 21:33:26,525 DEBUG [cloud.async.AsyncJobManagerImpl] (Job-Executor-130:job-130) Executing org.apache.cloudstack.api.command.admin.systemvm.DestroySystemVmCmd for job-130
2013-12-17 21:33:26,527 DEBUG [cloud.api.ApiServlet] (catalina-exec-9:null) ===END===  10.200.251.246 -- GET  command=destroySystemVm&id=94576696-a734-459b-b697-9ade8d616e68&response=json&sessionkey=yY8M0StWM6ohsnSO3nhPZGj7xKk%3D&_=1387333995495
2013-12-17 21:33:26,612 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-130:job-130) VM state transitted from :Running to Stopping with event: StopRequestedvm's original host id: 1 new host id: 1 host id before state transition: 1
2013-12-17 21:33:26,618 WARN  [cloud.vm.VirtualMachineManagerImpl] (Job-Executor-130:job-130) Unable to stop vm, agent unavailable: com.cloud.exception.AgentUnavailableException: Resource [Host:1] is unreachable: Host 1: Host with specified id is not in the right state: Disconnected
2013-12-17 21:33:26,618 WARN  [cloud.vm.VirtualMachineManagerImpl] (Job-Executor-130:job-130) Unable to stop vm VM[SecondaryStorageVm|s-1-VM]
2013-12-17 21:33:26,628 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-130:job-130) VM state transitted from :Stopping to Running with event: OperationFailedvm's original host id: 1 new host id: 1 host id before state transition: 1
2013-12-17 21:33:26,628 DEBUG [cloud.vm.VirtualMachineManagerImpl] (Job-Executor-130:job-130) Unable to stop the VM so we can't expunge it.
2013-12-17 21:33:26,628 DEBUG [cloud.vm.VirtualMachineManagerImpl] (Job-Executor-130:job-130) Unable to destroy the vm because it is not in the correct state: VM[SecondaryStorageVm|s-1-VM]
2013-12-17 21:33:26,628 INFO  [cloud.vm.VirtualMachineManagerImpl] (Job-Executor-130:job-130) Did not expunge VM[SecondaryStorageVm|s-1-VM]
2013-12-17 21:33:26,640 DEBUG [cloud.async.AsyncJobManagerImpl] (Job-Executor-130:job-130) Complete async job-130, jobStatus: 2, resultCode: 530, result: Error Code: 530 Error text: Fail to destroy system vm
2013-12-17 21:33:26,728 DEBUG [agent.transport.Request] (StatsCollector-1:null) Seq 15-1464552034: Received:  { Ans: , MgmtId: 345051385634, via: 15, Ver: v1, Flags: 10, { GetHostStatsAnswer } }
2013-12-17 21:33:27,100 DEBUG [agent.manager.AgentManagerImpl] (AgentManager-Handler-13:null) Ping from 8
2013-12-17 21:33:27,235 DEBUG [agent.manager.AgentManagerImpl] (AgentManager-Handler-9:null) Ping from 14
2013-12-17 21:33:27,454 DEBUG [agent.transport.Request] (AgentManager-Handler-8:null) Seq 8-1342917711: Processing:  { Ans: , MgmtId: 345051385634, via: 8, Ver: v1, Flags: 10, [{"Answer":{"result":false,"details":"timeout","wait":0}}] }
2013-12-17 21:33:27,455 DEBUG [agent.transport.Request] (AgentManager-Handler-12:null) Seq 8-1342917712: Processing:  { Ans: , MgmtId: 345051385634, via: 8, Ver: v1, Flags: 10, [{"Answer":{"result":false,"details":"timeout","wait":0}}] }
2013-12-17 21:33:27,455 DEBUG [agent.transport.Request] (AgentTaskPool-3:null) Seq 8-1342917711: Received:  { Ans: , MgmtId: 345051385634, via: 8, Ver: v1, Flags: 10, { Answer } }
2013-12-17 21:33:27,455 DEBUG [cloud.ha.AbstractInvestigatorImpl] (AgentTaskPool-3:null) host (10.196.53.73) cannot be pinged, returning null ('I don't know')
2013-12-17 21:33:27,455 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-3:null) sending ping from (9) to agent's host ip address (10.196.53.73)
2013-12-17 21:33:27,455 DEBUG [agent.transport.Request] (AgentTaskPool-16:null) Seq 8-1342917712: Received:  { Ans: , MgmtId: 345051385634, via: 8, Ver: v1, Flags: 10, { Answer } }
2013-12-17 21:33:27,455 DEBUG [cloud.ha.AbstractInvestigatorImpl] (AgentTaskPool-16:null) host (10.196.53.74) cannot be pinged, returning null ('I don't know')
2013-12-17 21:33:27,455 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-16:null) sending ping from (9) to agent's host ip address (10.196.53.74)
2013-12-17 21:33:27,460 DEBUG [agent.transport.Request] (AgentTaskPool-3:null) Seq 9-241192500: Sending  { Cmd , MgmtId: 345051385634, via: 9, Ver: v1, Flags: 100011, [{"PingTestCommand":{"_computingHostIp":"10.196.53.73","wait":20}}] }
2013-12-17 21:33:27,461 DEBUG [agent.transport.Request] (AgentTaskPool-16:null) Seq 9-241192501: Sending  { Cmd , MgmtId: 345051385634, via: 9, Ver: v1, Flags: 100011, [{"PingTestCommand":{"_computingHostIp":"10.196.53.74","wait":20}}] }
2013-12-17 21:33:27,585 DEBUG [agent.transport.Request] (StatsCollector-1:null) Seq 16-1532317381: Received:  { Ans: , MgmtId: 345051385634, via: 16, Ver: v1, Flags: 10, { GetHostStatsAnswer } }
2013-12-17 21:33:27,890 DEBUG [agent.manager.AgentManagerImpl] (AgentManager-Handler-1:null) Ping from 11

关键信息:
Unable to destroy the vm because it is not in the correct state: VM[SecondaryStorageVm|s-1-VM]    
数据库信息
mysql> SELECT * FROM host WHERE  name like '%s-1-VM%'\G  //主机信息中的系统虚机信息
*************************** 1. row ***************************
                  id: 21
                name: s-1-VM
                uuid: 986db967-13a9-48ca-815b-c41d6951a3f3
              status: Disconnected
                type: SecondaryStorageVM
  private_ip_address: 10.196.53.74
     private_netmask: 255.255.255.0
 private_mac_address: 06:51:e0:00:00:07
  storage_ip_address: 10.196.53.82
     storage_netmask: 255.255.255.0
 storage_mac_address: 06:51:e0:00:00:07
storage_ip_address_2: NULL
storage_mac_address_2: NULL
   storage_netmask_2: NULL
          cluster_id: NULL
   public_ip_address: 10.196.53.76
      public_netmask: 255.255.255.0
  public_mac_address: 06:e0:2c:00:00:0e
          proxy_port: NULL
      data_center_id: 1
              pod_id: 1
                cpus: NULL
               speed: NULL
                 url: NoIqn
             fs_type: NULL
     hypervisor_type: NULL
  hypervisor_version: NULL
                 ram: 0
            resource: NULL
             version: 4.1.1
              parent: NULL
          total_size: NULL
        capabilities: NULL
                guid: s-1-VM-NfsSecondaryStorageResource
           available: 1
               setup: 0
         dom0_memory: 0
           last_ping: 1354828061
      mgmt_server_id: 345051385634
        disconnected: NULL
             created: 2013-12-18 05:18:54
             removed: NULL
        update_count: 2
      resource_state: Enabled
               owner: NULL
         lastUpdated: NULL
        engine_state: Disabled
1 row in set (0.00 sec)
mysql> SELECT * FROM vm_instance WHERE  name like '%s-1-VM%'\G //虚拟机实例中的系统虚机信息,cloudstack界面上面的实例以及系统虚机状态均从该表中的state字段读取。
*************************** 1. row ***************************
                id: 22
              name: s-1-VM
              uuid: 8bd3ab0c-a431-4dd2-85a7-013921427f6a
     instance_name: s-1-VM
             state: Running
    vm_template_id: 3
       guest_os_id: 15
private_mac_address: 06:51:e0:00:00:07
private_ip_address: 10.196.53.74
            pod_id: 1
    data_center_id: 1
           host_id: 15
      last_host_id: 15
          proxy_id: 55
 proxy_assign_time: 2013-12-18 05:20:52
      vnc_password: VoRRPovUk7w7/+islEFf9Ai0tbTep0WOJJod0PLOJkU=
        ha_enabled: 0
     limit_cpu_use: 0
      update_count: 3
       update_time: 2013-12-18 05:18:59
           created: 2013-12-18 05:17:04
           removed: NULL
              type: SecondaryStorageVm
           vm_type: SecondaryStorageVm
        account_id: 1
         domain_id: 1
service_offering_id: 9
    reservation_id: a2a55809-abfa-4b6e-92f8-105cf8bef2a8
   hypervisor_type: KVM
  disk_offering_id: NULL
               cpu: NULL
               ram: NULL
             owner: NULL
             speed: NULL
         host_name: NULL
      display_name: NULL
     desired_state: NULL
1 row in set (0.01 sec)
问题的关键点
就是数据库中两个字段的红色标注部分    ,一个表中显示的是Disconnected ,一个表中显示的是Running, CloudStack 的UI界面上面显示两个系统虚机也是Running。
问题解决:
了解这两个虚拟机的朋友都知道,这是个很强大的虚拟机,删除之后能够重建,一般这两个虚拟机出现了故障,也是通过删除,重建解决的,既然UI界面上面无法删除,那就在数据库中修改相应字段,将其状态置为Destroyed即可。
UPDATE vm_instance SET state='Destroyed'  WHERE name='s-1-VM';
UPDATE vm_instance SET state='Destroyed'  WHERE name='v-2-VM';
然后回到CloudStack UI界面查看
spacer.gif175843246.png
系统检测到原有的两个系统虚机状态都为Destroyed,就开始重建新的SSVM跟CPVM,等待其状态显示为Running,系统就恢复正常了。