2021SC@SDUSC
nova/compute/manager.py
ComputeManager概述:
用于处理与创建实例(guest vms)相关的RPC调用,它负责:
① 创建磁盘镜像
② 通过底层的虚拟化驱动launch it
③ 回复查看磁盘镜像状态的调用(原文: responding to calls to check its state)
④ 附加持久化存储(原文:attaching persistent storage)
⑤ 结束磁盘镜像
核心源码分析:class ComputeManager(manager.Manager)
< From line 525 in manager.py >
- **Intruduction: **负责实例从创建到销毁全生命周期管理
- 成员函数讲解:
**(1)__ init__:**加载配置选项,并连接hypervisor
它先初始化了很多实例的成员变量(通过调用相应类的构造函数完成),这些变量基本都是各种服务的客户端或api,包括:负责更新scheduler的SchedulerReportClient,负责网络连接的neutron.API,负责块存储的cinder.API,负责镜像管理的glance.API,负责计算的compute.API,负责处理compute rpc的compute_rpcapi.ComputeAPI等,并创建了大小由CONF指定的线程池,设置信号量(用于在原子操作情境下锁住实例)等等。然后,它加载了计算驱动模块,并且用上面初始化好的成员创建一个追踪实例从创建到销毁过程中资源使用情况的ResourceTracker。
def __init__(self, compute_driver=None, *args, **kwargs):
"""Load configuration options and connect to the hypervisor."""
# We want the ComputeManager, ResourceTracker and ComputeVirtAPI all
# using the same instance of SchedulerReportClient which has the
# ProviderTree cache for this compute service.
self.reportclient = report.SchedulerReportClient()
self.virtapi = ComputeVirtAPI(self)
self.network_api = neutron.API()
self.volume_api = cinder.API()
self.image_api = glance.API()
self._last_bw_usage_poll = 0.0
self.compute_api = compute.API()
self.compute_rpcapi = compute_rpcapi.ComputeAPI()
self.compute_task_api = conductor.ComputeTaskAPI()
self.query_client = query.SchedulerQueryClient()
self.instance_events = InstanceEvents()
self._sync_power_pool = eventlet.GreenPool(
size=CONF.sync_power_state_pool_size)
self._syncs_in_progress = {}
self.send_instance_updates = (
CONF.filter_scheduler.track_instance_changes)
if CONF.max_concurrent_builds != 0:
self._build_semaphore = eventlet.semaphore.Semaphore(
CONF.max_concurrent_builds)
else:
self._build_semaphore = compute_utils.UnlimitedSemaphore()
if CONF.max_concurrent_snapshots > 0:
self._snapshot_semaphore = eventlet.semaphore.Semaphore(
CONF.max_concurrent_snapshots)
else:
self._snapshot_semaphore = compute_utils.UnlimitedSemaphore()
if CONF.max_concurrent_live_migrations > 0:
self._live_migration_executor = futurist.GreenThreadPoolExecutor(
max_workers=CONF.max_concurrent_live_migrations)
else:
# CONF.max_concurrent_live_migrations is 0 (unlimited)
self._live_migration_executor = futurist.GreenThreadPoolExecutor()
# This is a dict, keyed by instance uuid, to a two-item tuple of
# migration object and Future for the queued live migration.
self._waiting_live_migrations = {}
super(ComputeManager, self).__init__(service_name="compute",
*args, **kwargs)
# TODO(sbauza): Remove this call once we delete the V5Proxy class
self.additional_endpoints.append(_ComputeV5Proxy(self))
# NOTE(russellb) Load the driver last. It may call back into the
# compute manager via the virtapi, so we want it to be fully
# initialized before that happens.
self.driver = driver.load_compute_driver(self.virtapi, compute_driver)
self.use_legacy_block_device_info = \
self.driver.need_legacy_block_device_info
self.rt = resource_tracker.ResourceTracker(
self.host, self.driver, reportclient=self.reportclient)
(2)build_and_run_instance:
先用了一个嵌套函数 _ locked_do_build_and_run_instance,内部逻辑就是先用一个with语句抓住该实例的信号量(前文提及),保证该实例等待被创建的过程中没有其它进程会操作这个实例,然后通过调用_ do_build_and_run_instance创建实例,后续处理了一些异常,并返回创建结果(是否成功)。
@utils.synchronized(instance.uuid)
def _locked_do_build_and_run_instance(*args, **kwargs):
with self._build_semaphore: # 用的时候锁住实例,操作完成后及时解锁
try:
result = self._do_build_and_run_instance(*args, **kwargs)
except Exception:
result = build_results.FAILED
raise
finally:
if result == build_results.FAILED:
self.reportclient.delete_allocation_for_instance(
context, instance.uuid)
if result in (build_results.FAILED,
build_results.RESCHEDULED):
self._build_failed(node)
else:
self._build_succeeded(node)
_do_build_and_run_instance(…)的创建过程是:
① 将实例的vm_state设为“building”(来自于nova/objects/fields.py中的InstanceState类,其中包含ACTIVE、BUILDING、PAUSED、SUSPENDED等若干状态变量的值)
② 将实例的task_state置为None(如果有值,来自InstanceTaskState类,其中包括SCHEDULING、BLOCK_DEVICE_MAPPING、NETWORDING、SPAWNING等若干状态变量值)
③ 调用nova/objects/instance.py中Instance类的save(…)函数,将self.fields的变化值存储在数据库中
——内部逻辑是:先存好context,然后调用obj_what_changed()函数,获取变化的field属性值(self.fields包括id, user_id, project_id, image_ref, kernel_id等若干属性),此处要处理一些exceptions,如果一切正常,则调用nova/db/api.py的instance_extra_update_by_uuid(…)函数,将变化保存到数据库中)
④ 处理一些exceptions
⑤ 将injected_files(来自于参数)解码
⑥ 如果limities为None,则将其置为空集
⑦ 如果参数中没有指定node,则通过调用_get_nodename(…)为传入的instance获取第一个可用的节点
⑧ 打开计时器(之后会在log.info中输出创建instance所用的时间),调用_build_and_run_instance(…)函数,真正创建一个实例
——代码及注释见下:
def _build_and_run_instance(self, context, instance, image, injected_files,
admin_password, requested_networks, security_groups,
block_device_mapping, node, limits, filter_properties,
request_spec=None, accel_uuids=None):
image_name = image.get('name')
# 从image中获取镜像名称
self._notify_about_instance_usage(context, instance, 'create.start',
extra_usage_info={'image_name': image_name})
# 向外部发出一个 start to create instance 的通知
compute_utils.notify_about_instance_create(
context, instance, self.host,
phase=fields.NotificationPhase.START,
bdms=block_device_mapping)
# 向外部发出一个 start to create instance 的通知(不是很理解区别)
# NOTE(mikal): cache the keystone roles associated with the instance
# at boot time for later reference
instance.system_metadata.update(
{'boot_roles': ','.join(context.roles)})
# 缓存与该实例相关的keystone roles
self._check_device_tagging(requested_networks, block_device_mapping)
# 检查当前资源是否足够创建镜像
self._check_trusted_certs(instance)
# 检查certifications
provider_mapping = self._get_request_group_mapping(request_spec)
# 返回request_id:provider_id字典
if provider_mapping:
try:
compute_utils\
.update_pci_request_spec_with_allocated_interface_name(
context, self.reportclient,
instance.pci_requests.requests, provider_mapping)
# Update the instance's PCI request based on the request group -
# resource provider mapping and the device RP name from placement.
# 不是很理解
except (exception.AmbiguousResourceProviderForPCIRequest,
exception.UnexpectedResourceProviderNameForPCIRequest
) as e:
raise exception.BuildAbortException(
reason=str(e), instance_uuid=instance.uuid)
# TODO(Luyao) cut over to get_allocs_for_consumer
allocs = self.reportclient.get_allocations_for_consumer(
context, instance.uuid)
# Makes a GET /allocations/{consumer} call to Placement
try:
scheduler_hints = self._get_scheduler_hints(filter_properties,
request_spec)
with self.rt.instance_claim(context, instance, node, allocs,
limits):
# NOTE(russellb) It's important that this validation be done
# *after* the resource tracker instance claim, as that is where
# the host is set on the instance.
self._validate_instance_group_policy(context, instance,
scheduler_hints)
image_meta = objects.ImageMeta.from_dict(image)
with self._build_resources(context, instance,
requested_networks, security_groups, image_meta,
block_device_mapping, provider_mapping,
accel_uuids) as resources:
instance.vm_state = vm_states.BUILDING
instance.task_state = task_states.SPAWNING
# NOTE(JoshNang) This also saves the changes to the
# instance from _allocate_network_async, as they aren't
# saved in that function to prevent races.
instance.save(expected_task_state=
task_states.BLOCK_DEVICE_MAPPING)
block_device_info = resources['block_device_info']
network_info = resources['network_info']
accel_info = resources['accel_info']
LOG.debug('Start spawning the instance on the hypervisor.',
instance=instance)
with timeutils.StopWatch() as timer:
self.driver.spawn(context, instance, image_meta,
injected_files, admin_password,
allocs, network_info=network_info,
block_device_info=block_device_info,
accel_info=accel_info)
LOG.info('Took %0.2f seconds to spawn the instance on '
'the hypervisor.', timer.elapsed(),
instance=instance)
except (exception.InstanceNotFound,
exception.UnexpectedDeletingTaskStateError) as e:
with excutils.save_and_reraise_exception():
self._notify_about_instance_usage(context, instance,
'create.error', fault=e)
compute_utils.notify_about_instance_create(
context, instance, self.host,
phase=fields.NotificationPhase.ERROR, exception=e,
bdms=block_device_mapping)
except exception.ComputeResourcesUnavailable as e:
LOG.debug(e.format_message(), instance=instance)
self._notify_about_instance_usage(context, instance,
'create.error', fault=e)
compute_utils.notify_about_instance_create(
context, instance, self.host,
phase=fields.NotificationPhase.ERROR, exception=e,
bdms=block_device_mapping)
raise exception.RescheduledException(
instance_uuid=instance.uuid, reason=e.format_message())
except exception.BuildAbortException as e:
with excutils.save_and_reraise_exception():
LOG.debug(e.format_message(), instance=instance)
self._notify_about_instance_usage(context, instance,
'create.error', fault=e)
compute_utils.notify_about_instance_create(
context, instance, self.host,
phase=fields.NotificationPhase.ERROR, exception=e,
bdms=block_device_mapping)
except exception.NoMoreFixedIps as e:
LOG.warning('No more fixed IP to be allocated',
instance=instance)
self._notify_about_instance_usage(context, instance,
'create.error', fault=e)
compute_utils.notify_about_instance_create(
context, instance, self.host,
phase=fields.NotificationPhase.ERROR, exception=e,
bdms=block_device_mapping)
msg = _('Failed to allocate the network(s) with error %s, '
'not rescheduling.') % e.format_message()
raise exception.BuildAbortException(instance_uuid=instance.uuid,
reason=msg)
except (exception.ExternalNetworkAttachForbidden,
exception.VirtualInterfaceCreateException,
exception.VirtualInterfaceMacAddressException,
exception.FixedIpInvalidOnHost,
exception.UnableToAutoAllocateNetwork,
exception.NetworksWithQoSPolicyNotSupported) as e:
LOG.exception('Failed to allocate network(s)',
instance=instance)
self._notify_about_instance_usage(context, instance,
'create.error', fault=e)
compute_utils.notify_about_instance_create(
context, instance, self.host,
phase=fields.NotificationPhase.ERROR, exception=e,
bdms=block_device_mapping)
msg = _('Failed to allocate the network(s), not rescheduling.')
raise exception.BuildAbortException(instance_uuid=instance.uuid,
reason=msg)
except (exception.FlavorDiskTooSmall,
exception.FlavorMemoryTooSmall,
exception.ImageNotActive,
exception.ImageUnacceptable,
exception.InvalidDiskInfo,
exception.InvalidDiskFormat,
cursive_exception.SignatureVerificationError,
exception.CertificateValidationFailed,
exception.VolumeEncryptionNotSupported,
exception.InvalidInput,
# TODO(mriedem): We should be validating RequestedVRamTooHigh
# in the API during server create and rebuild.
exception.RequestedVRamTooHigh) as e:
self._notify_about_instance_usage(context, instance,
'create.error', fault=e)
compute_utils.notify_about_instance_create(
context, instance, self.host,
phase=fields.NotificationPhase.ERROR, exception=e,
bdms=block_device_mapping)
raise exception.BuildAbortException(instance_uuid=instance.uuid,
reason=e.format_message())
except Exception as e:
LOG.exception('Failed to build and run instance',
instance=instance)
self._notify_about_instance_usage(context, instance,
'create.error', fault=e)
compute_utils.notify_about_instance_create(
context, instance, self.host,
phase=fields.NotificationPhase.ERROR, exception=e,
bdms=block_device_mapping)
raise exception.RescheduledException(
instance_uuid=instance.uuid, reason=str(e))
# NOTE(alaski): This is only useful during reschedules, remove it now.
instance.system_metadata.pop('network_allocated', None)
# If CONF.default_access_ip_network_name is set, grab the
# corresponding network and set the access ip values accordingly.
network_name = CONF.default_access_ip_network_name
if (network_name and not instance.access_ip_v4 and
not instance.access_ip_v6):
# Note that when there are multiple ips to choose from, an
# arbitrary one will be chosen.
for vif in network_info:
if vif['network']['label'] == network_name:
for ip in vif.fixed_ips():
if not instance.access_ip_v4 and ip['version'] == 4:
instance.access_ip_v4 = ip['address']
if not instance.access_ip_v6 and ip['version'] == 6:
instance.access_ip_v6 = ip['address']
break
self._update_instance_after_spawn(instance)
#更新了一些实例的状态,比如vm_state等
try:
instance.save(expected_task_state=task_states.SPAWNING)
#保存实例当前状态
except (exception.InstanceNotFound,
exception.UnexpectedDeletingTaskStateError) as e:
with excutils.save_and_reraise_exception():
self._notify_about_instance_usage(context, instance,
'create.error', fault=e)
compute_utils.notify_about_instance_create(
context, instance, self.host,
phase=fields.NotificationPhase.ERROR, exception=e,
bdms=block_device_mapping)
self._update_scheduler_instance_info(context, instance)
# 把新创建或更新的镜像交给Scheduler处理
self._notify_about_instance_usage(context, instance, 'create.end',
extra_usage_info={'message': _('Success')},
network_info=network_info)
# 将镜像状态改为create.end
compute_utils.notify_about_instance_create(context, instance,
self.host, phase=fields.NotificationPhase.END,
bdms=block_device_mapping)
#将镜像所在阶段改为NotificationPhase.END("end")