Hadoop 调度器介绍
调度器是Hadoop Yarn Master/Slave结构中作为Master的ResourceManager(RM)中的核心部件,负责对各种资源请求进行调度,目支持的资源是内存(Memory)和CPU,目前Yarn支持的调度器主要有FifoScheduler,CapacityScheduler以及FairScheduler这三种,而我们公司采用了FairScheduler,下面就对FairScheduler的核心源码进行详细分析,分析的hadoop版本为 hadoop2.6.0-cdh5.12.1。
而FairScheduler目前的调度方式有两种,一种是只能通过NM心跳汇报可用资源进行调度,第二种是不是每次通过NM心跳来调度,而是对所有的Node进行遍历找到可以调度资源即进行调度。接着上一篇,我们对心跳调度进行分析。
1.RM端收到NM端的一次心跳源码详解
调度器,顾名思义就是对可以使用的资源进行分配,那调度器是怎么知道有资源可以用呢?肯定是资源拥有者告诉我它那里有资源了你快来拿。而这里资源的真正拥有者就是NodeMangaer(NM),NM通过心跳(HeartBeat),周期性的告诉RM当前是否有资源可用。
下面我们就来看一下RM端对于收到NM心跳的处理的代码:
/**
* FairScheduler处理NM的心跳的过程
*/
private synchronized void nodeUpdate(RMNode nm) {
long start = getClock().getTime();
if (LOG.isDebugEnabled()) {
LOG.debug("nodeUpdate: " + nm +
" cluster capacity: " + getClusterResource());
}
eventLog.log("HEARTBEAT", nm.getHostName());
FSSchedulerNode node = getFSSchedulerNode(nm.getNodeID());
//用来存储NM有更新的容器的列表
List<UpdatedContainerInfo> containerInfoList = nm.pullContainerUpdates();
//用来存储新投运的容器的列表
List<ContainerStatus> newlyLaunchedContainers = new ArrayList<ContainerStatus>();
//用来存储已经完成运行的列表
List<ContainerStatus> completedContainers = new ArrayList<ContainerStatus>();
for(UpdatedContainerInfo containerInfo : containerInfoList) {
newlyLaunchedContainers.addAll(containerInfo.getNewlyLaunchedContainers());
completedContainers.addAll(containerInfo.getCompletedContainers());
}
// Processing the newly launched containers
for (ContainerStatus launchedContainer : newlyLaunchedContainers) {
containerLaunchedOnNode(launchedContainer.getContainerId(), node);
}
// Process completed containers
for (ContainerStatus completedContainer : completedContainers) {
ContainerId containerId = completedContainer.getContainerId();
LOG.debug("Container FINISHED: " + containerId);
completedContainer(getRMContainer(containerId),
completedContainer, RMContainerEventType.FINISHED);
}
// If the node is decommissioning, send an update to have the total
// resource equal to the used resource, so no available resource to
// schedule.
if (nm.getState() == NodeState.DECOMMISSIONING) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMNodeResourceUpdateEvent(nm.getNodeID(), ResourceOption
.newInstance(getSchedulerNode(nm.getNodeID())
.getUsedResource(), 0)));
}
//是否是持续调度
if (continuousSchedulingEnabled) {
if (!completedContainers.isEmpty()) {
attemptScheduling(node);
}
} else {
attemptScheduling(node);
}
long duration = getClock().getTime() - start;
fsOpDurations.addNodeUpdateDuration(duration);
}
FairScheduler对于NM的心跳的处理是synchronized的,效率较低 ,Node在 调度器中呈现的形式是 FSSchedulerNode。通过NM心跳的发送到RM的容器的更新情况存放在containerInfoList 中,然后遍历containerInfoList,将新投运的container列表存放在newlyLaunchedContainers中,将完成的运行的container列表存放到completedContainers 列表中。
- 触发已经新投运的container在RM端的状态机跳变
对已经新投运的容器列表进行遍历,对每一个新投运的容器通过函数containerLaunchedOnNode进行RM端的container状态更新,下面我们就来看一下这个函数具体的代码:
protected synchronized void containerLaunchedOnNode(
ContainerId containerId, SchedulerNode node) {
// 得到当前的container对应的app在调度器中的形式
SchedulerApplicationAttempt application = getCurrentAttemptForContainer
(containerId);
if (application == null) {
LOG.info("Unknown application "
+ containerId.getApplicationAttemptId().getApplicationId()
+ " launched container " + containerId + " on node: " + node);
this.rmContext.getDispatcher().getEventHandler()
.handle(new RMNodeCleanContainerEvent(node.getNodeID(), containerId));
return;
}
application.containerLaunchedOnNode(containerId, node.getNodeID());
}
在当前containerID对应的appliacation,若不为空,则调用application.containerLaunchedOnNode。
public synchronized void containerLaunchedOnNode(ContainerId containerId,
NodeId nodeId) {
// 通知RMContainer
RMContainer rmContainer = getRMContainer(containerId);
if (rmContainer == null) {
// Some unknown container sneaked into the system. Kill it.
rmContext.getDispatcher().getEventHandler()
.handle(new RMNodeCleanContainerEvent(nodeId, containerId));
return;
}
rmContainer.handle(new RMContainerEvent(containerId,
RMContainerEventType.LAUNCHED));
}
RMContainer是container在RM端的呈现形式,如果containerID对应的rmContainer不为空,就触发该RMContainer的状态机跳变事件,状态转为RMContainerEventType.LAUNCHED,表明该appAttempt对应的container已经投运,目的主要是在RM端记录container的状态。
- 触发已经运行完成的container在RM端相关的状态机跳变和数据清理
对已经运行完成的容器列表进行遍历,对每一个已经运行完成的容器通过函数completedContainer进行RM端的container状态更新,下面我们就来看一下这个函数具体的代码:
/**
* Clean up a completed container.
*/
@Override
protected synchronized void completedContainer(RMContainer rmContainer,
ContainerStatus containerStatus, RMContainerEventType event) {
if (rmContainer == null) {
LOG.info("Container " + containerStatus.getContainerId()
+ " completed with event " + event);
return;
}
Container container = rmContainer.getContainer();
// 得到完成的container对应的FSAppAttempt以及对应的appID
FSAppAttempt application =
getCurrentAttemptForContainer(container.getId());
ApplicationId appId =
container.getId().getApplicationAttemptId().getApplicationId();
if (application == null) {
LOG.info("Container " + container + " of" +
" finished application " + appId +
" completed with event " + event);
return;
}
// Get the node on which the container was allocated
FSSchedulerNode node = getFSSchedulerNode(container.getNodeId());
if (rmContainer.getState() == RMContainerState.RESERVED) {
application.unreserve(rmContainer.getReservedPriority(), node);
} else {
application.containerCompleted(rmContainer, containerStatus, event);
node.releaseContainer(container);
updateRootQueueMetrics();
}
LOG.info("Application attempt " + application.getApplicationAttemptId()
+ " released container " + container.getId() + " on node: " + node
+ " with event: " + event);
}
关键的代码是这一块:
if (rmContainer.getState() == RMContainerState.RESERVED) {
application.unreserve(rmContainer.getReservedPriority(), node);
} else {
application.containerCompleted(rmContainer, containerStatus, event);
node.releaseContainer(container);
updateRootQueueMetrics();
}
分别对于预先预留的情况以及非预先预留的情况进行清理工作以及相应的状态机事件,包括RM端对应的一个app尝试FSAppAttemp中,RM端对应的RMContainer中,以及RM调度器层面的Node表示FSSchedulerNode中的对应的数据结构的清理工作和状态机变化,最后是更新对应的metrics信息。
下面开启最为关键的调度过程分析
上面我们已经对NM心跳中的新投运的container,以及已经完成的container的引起的RM端相关数据结构变化,以及状态机的变化进行的简要分析,下面就对一次心跳引起的具体调度行为进行详细分析:
if (continuousSchedulingEnabled) {
//判断是否是持续调度
if (!completedContainers.isEmpty()) {
attemptScheduling(node);
}
} else {
attemptScheduling(node);
}
首先判断是否设置了持续调度,由于持续调度是poll遍历所有的节点,所以遍历到的节点并不一定有资源可以调度,所以必须有已经运行完成的container才去执行调度函数attemptScheduling(node),而通过心跳来调度的话,每一次心跳就要进行调度尝试。接下来我们详细分析一下这个调度尝试attemptScheduling(node):
synchronized void attemptScheduling(FSSchedulerNode node) {
if (rmContext.isWorkPreservingRecoveryEnabled()
&& !rmContext.isSchedulerReadyForAllocatingContainers()) {
return;
}
final NodeId nodeID = node.getNodeID();
if (!nodeTracker.exists(nodeID)) {
// The node might have just been removed while this thread was waiting
// on the synchronized lock before it entered this synchronized method
LOG.info("Skipping scheduling as the node " + nodeID +
" has been removed");
return;
}
// Assign new containers...
// 1. Check for reserved applications
// 2. Schedule if there are no reservations
boolean validReservation = false;
FSAppAttempt reservedAppSchedulable = node.getReservedAppSchedulable();
if (reservedAppSchedulable != null) {
validReservation = reservedAppSchedulable.assignReservedContainer(node);
}
if (!validReservation) {
// No reservation, schedule at queue which is farthest below fair share
int assignedContainers = 0;
Resource assignedResource = Resources.clone(Resources.none());
Resource maxResourcesToAssign =
Resources.multiply(node.getAvailableResource(), 0.5f);
while (node.getReservedContainer() == null) {
boolean assignedContainer = false;
Resource assignment = queueMgr.getRootQueue().assignContainer(node);
if (!assignment.equals(Resources.none())) {
assignedContainers++;
assignedContainer = true;
Resources.addTo(assignedResource, assignment);
}
if (!assignedContainer) { break; }
if (!shouldContinueAssigning(assignedContainers,
maxResourcesToAssign, assignedResource)) {
break;
}
}
}
updateRootQueueMetrics();
}
首先判断该节点是否有某个appAttempt的预留资源,若有预留资源则调用reservedAppSchedulable.assignReservedContainer(node)函数,首先对预留资源进行实际分配:
boolean assignReservedContainer(FSSchedulerNode node) {
RMContainer rmContainer = node.getReservedContainer();
Priority reservedPriority = rmContainer.getReservedPriority();
if (!isValidReservation(node)) {
// Don't hold the reservation if app can no longer use it
LOG.info("Releasing reservation that cannot be satisfied for " +
"application " + getApplicationAttemptId() + " on node " + node);
unreserve(reservedPriority, node);
return false;
}
// Reservation valid; try to fulfill the reservation
if (LOG.isDebugEnabled()) {
LOG.debug("Trying to fulfill reservation for application "
+ getApplicationAttemptId() + " on node: " + node);
}
// Fail early if the reserved container won't fit.
// Note that we have an assumption here that
// there's only one container size per priority.
if (Resources.fitsIn(node.getReservedContainer().getReservedResource(),
node.getAvailableResource())) {
assignContainer(node, true);
}
return true;
}
符合要求调用改appAttemp的 assignContainer函数,同样没有appAttempt对该节点进行预留资源时,也会调用该函数,第二个参数 决定了是否是预留资源:
private Resource assignContainer(FSSchedulerNode node, boolean reserved) {
if (LOG.isTraceEnabled()) {
LOG.trace("Node offered to app: " + getName() + " reserved: " + reserved);
}
Collection<Priority> prioritiesToTry = (reserved) ?
Arrays.asList(node.getReservedContainer().getReservedPriority()) :
getPriorities();
// For each priority, see if we can schedule a node local, rack local
// or off-switch request. Rack of off-switch requests may be delayed
// (not scheduled) in order to promote better locality.
synchronized (this) {
for (Priority priority : prioritiesToTry) {
// Skip it for reserved container, since
// we already check it in isValidReservation.
if (!reserved && !hasContainerForNode(priority, node)) {
continue;
}
addSchedulingOpportunity(priority);
ResourceRequest rackLocalRequest = getResourceRequest(priority,
node.getRackName());
ResourceRequest localRequest = getResourceRequest(priority,
node.getNodeName());
if (localRequest != null && !localRequest.getRelaxLocality()) {
LOG.warn("Relax locality off is not supported on local request: "
+ localRequest);
}
NodeType allowedLocality;
if (scheduler.isContinuousSchedulingEnabled()) {
allowedLocality = getAllowedLocalityLevelByTime(priority,
scheduler.getNodeLocalityDelayMs(),
scheduler.getRackLocalityDelayMs(),
scheduler.getClock().getTime());
} else {
allowedLocality = getAllowedLocalityLevel(priority,
scheduler.getNumClusterNodes(),
scheduler.getNodeLocalityThreshold(),
scheduler.getRackLocalityThreshold());
}
if (rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
&& localRequest != null && localRequest.getNumContainers() != 0) {
return assignContainer(node, localRequest,
NodeType.NODE_LOCAL, reserved);
}
if (rackLocalRequest != null && !rackLocalRequest.getRelaxLocality()) {
continue;
}
if (rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
&& (allowedLocality.equals(NodeType.RACK_LOCAL) ||
allowedLocality.equals(NodeType.OFF_SWITCH))) {
return assignContainer(node, rackLocalRequest,
NodeType.RACK_LOCAL, reserved);
}
ResourceRequest offSwitchRequest =
getResourceRequest(priority, ResourceRequest.ANY);
if (offSwitchRequest != null && !offSwitchRequest.getRelaxLocality()) {
continue;
}
if (offSwitchRequest != null &&
offSwitchRequest.getNumContainers() != 0) {
if (!hasNodeOrRackLocalRequests(priority) ||
allowedLocality.equals(NodeType.OFF_SWITCH)) {
return assignContainer(
node, offSwitchRequest, NodeType.OFF_SWITCH, reserved);
}
}
}
}
return Resources.none();
}
再回到非资源预留的情况,调用的是如下代码:
Resource assignment = queueMgr.getRootQueue().assignContainer(node);
从根节点开始,到子节点,再到某一个appAttempt进行分配。