背景

HDFS集群规模日益扩大之后,集群中难免会出现一些“慢节点“,主要表现为网络数据传输变慢、磁盘读写变慢。平常这些慢节点很难被发现,只有当业务作业数据读写涉及到这些节点,导致作业运行时间延长,我们才会发现集群读写变慢了,进而去定位具体变慢的节点。

所以慢节点一直是HDFS集群运维中需重点关注的问题,在Hadoop2.9之后,社区支持了从Namenode jmx上查看慢节点的功能。

metrics格式如下,需要注意的是这里最多展示Top5个节点/磁盘:

"SlowPeersReport":[{"SlowNode":"node4","ReportingNodes":["node1"]},{"SlowNode":"node2","ReportingNodes":["node1","node3"]},{"SlowNode":"node1","ReportingNodes":["node2"]}]

"SlowDisksReport":[{"SlowDiskID":"dn3:disk1","Latencies":{"WRITE":1000.1}},{"SlowDiskID":"dn2:disk2","Latencies":{"WRITE":1000.1}},{"SlowDiskID":"dn1:disk2","Latencies":{"READ":1000.3}},{"SlowDiskID":"dn1:disk1","Latencies":{"METADATA":1000.1,"READ":1000.8}}]

网络慢监控

原理

监控一个DN网络传输变慢的原理是,记录集群中各个DN间的packet数据传输耗时,找出其中的异常值并作为慢节点上报给NN。正常情况下各节点间的传输速率基本一致,不会相差太多,假如出现了A传B耗时异常,A就往NN上报B是慢节点。

为了计算DN往下游传数据的平均耗时,DN内部维护了一个Map<String, LinkedBlockingDeque<SumAndCount>>,Map的key为下游DN的ip,value是一个存放SumAndCount对象的队列,SumAndCount对象用于记录往下游DN传输packet的数量与耗时。

DN在发送心跳的时候会判断是否需要生成SlowPeerReport,并将其作为心跳信息的一部分发送给NN。SlowPeerReport的生成周期由dfs.datanode.outliers.report.interval 参数控制,默认30min。首先从队列中取出所有packet传输耗时求平均值averageLatency,然后根据这些averageLatency,计算出慢节点上报阈值upperLimitLatency。当有节点的averageLatency大于upperLimitLatency,即认为该节点属于一个网络慢节点,且由DN1上报。最后生成对应的SlowPeerReport,通过心跳上报给NN。

慢节点阈值upperLimitLatency的计算逻辑

先算出所有下游DN传输耗时的中位数median,再算出中位数绝对偏差mad:

// MAD_MULTIPLIER = 1.4826
mad = median(|list[i]-median(list)|) * MAD_MULTIPLIER

最终upperLimitLatency为:

// lowThresholdMs = 5ms
upperLimitLatency = max(lowThresholdMs, median * 3, median + mad * 3)

代码详情如下:
org.apache.hadoop.hdfs.server.datanode.metrics.OutlierDetector.java

public Map<String, Double> getOutliers(Map<String, Double> stats) {
    // minNumResources=10,节点少于10个不参与计算
    if (stats.size() < minNumResources) {
      LOG.debug("Skipping statistical outlier detection as we don't have " +
              "latency data for enough resources. Have {}, need at least {}",
          stats.size(), minNumResources);
      return ImmutableMap.of();
    }
    final List<Double> sorted = new ArrayList<>(stats.values());
    Collections.sort(sorted);
    // 计算中位数median
    final Double median = computeMedian(sorted);
    // 计算中位数绝对偏差值mad
    final Double mad = computeMad(sorted);
    // 计算异常阈值upperLimitLatency
    Double upperLimitLatency = Math.max(
        lowThresholdMs, median * MEDIAN_MULTIPLIER);
    upperLimitLatency = Math.max(
        upperLimitLatency, median + (DEVIATION_MULTIPLIER * mad));

    final Map<String, Double> slowResources = new HashMap<>();

    // 找出大于异常阈值的节点
    for (Map.Entry<String, Double> entry : stats.entrySet()) {
      if (entry.getValue() > upperLimitLatency) {
        slowResources.put(entry.getKey(), entry.getValue());
      }
    }

    return slowResources;
}

public static Double computeMad(List<Double> sortedValues) {
    ...
    // 计算出中位数
    Double median = computeMedian(sortedValues);
    List<Double> deviations = new ArrayList<>(sortedValues);

    // 计算绝对偏差值
    for (int i = 0; i < sortedValues.size(); ++i) {
      deviations.set(i, Math.abs(sortedValues.get(i) - median));
    }

    Collections.sort(deviations);
    // MAD_MULTIPLIER = 1.4826
    return computeMedian(deviations) * MAD_MULTIPLIER;
}

public static Double computeMedian(List<Double> sortedValues) {
    ...
    Double median = sortedValues.get(sortedValues.size() / 2);
    if (sortedValues.size() % 2 == 0) {
      median += sortedValues.get((sortedValues.size() / 2) - 1);
      median /= 2;
    }
    return median;
}
监控流程代码分析

org.apache.hadoop.hdfs.server.datanode.DataNode.class
首先是在DataNode的startDataNode方法,创建DataNodePeerMetrics对象

void startDataNode(List<StorageLocation> dataDirectories,
                     SecureResources resources
                     ) throws IOException {
   ...
   // peerStatsEnabled由 dfs.datanode.peer.stats.enabled 参数决定
   peerMetrics = dnConf.peerStatsEnabled ?
        DataNodePeerMetrics.create(getDisplayName(), getConf()) : null;
   ...
}

org.apache.hadoop.hdfs.server.datanode.BlockReceiver.java
在BlockReceiver类中记录packet传输耗时,然后写入DataNodePeerMetrics对象里

private int receivePacket() throws IOException {
    ...
    //First write the packet to the mirror:
    if (mirrorOut != null && !mirrorError) {
      try {
        // 记录开始时间
        long begin = Time.monotonicNow();
        DataNodeFaultInjector.get().stopSendingPacketDownstream(mirrorAddr);
        packetReceiver.mirrorPacketTo(mirrorOut);
        mirrorOut.flush();
        long now = Time.monotonicNow();
        this.lastSentTime.set(now);
        // 计算packet传输耗时
        long duration = now - begin;
        DataNodeFaultInjector.get().logDelaySendingPacketDownstream(
            mirrorAddr,
            duration);
        // 将耗时数据写入DataNodePeerMetrics
        trackSendPacketToLastNodeInPipeline(duration);
        if (duration > datanodeSlowLogThresholdMs && LOG.isWarnEnabled()) {
          LOG.warn("Slow BlockReceiver write packet to mirror took " + duration
              + "ms (threshold=" + datanodeSlowLogThresholdMs + "ms), "
              + "downstream DNs=" + Arrays.toString(downstreamDNs)
              + ", blockId=" + replicaInfo.getBlockId());
        }
      } catch (IOException e) {
        handleMirrorOutError(e);
      }
    }
    ...
}

private void trackSendPacketToLastNodeInPipeline(final long elapsedMs) {
    // 获取DataNodePeerMetrics对象
    final DataNodePeerMetrics peerMetrics = datanode.getPeerMetrics();
    // peerMetrics是否为空由 dfs.datanode.peer.stats.enabled 参数决定
    if (peerMetrics != null && isPenultimateNode) {
      peerMetrics.addSendPacketDownstream(mirrorNameForMetrics, elapsedMs);
    }
  }

org.apache.hadoop.hdfs.server.datanode.BPServiceActor.java
BPServiceActor类在发送心跳时,从DataNodePeerMetrics对象中取出慢节点数据,组成SlowDiskReports发送给NN

HeartbeatResponse sendHeartBeat(boolean requestBlockReportLease)
      throws IOException {
    ...
    // 计算是否达到间隔时间(默认30min)
    final boolean outliersReportDue = scheduler.isOutliersReportDue(now);
    // 是否生成慢节点报告
    final SlowPeerReports slowPeers =
        outliersReportDue && dn.getPeerMetrics() != null ?
            SlowPeerReports.create(dn.getPeerMetrics().getOutliers()) :
            SlowPeerReports.EMPTY_REPORT;
    final SlowDiskReports slowDisks =
        outliersReportDue && dn.getDiskMetrics() != null ?
            SlowDiskReports.create(dn.getDiskMetrics().getDiskOutliersStats()) :
            SlowDiskReports.EMPTY_REPORT;

    HeartbeatResponse response = bpNamenode.sendHeartbeat(bpRegistration,
        reports,
        dn.getFSDataset().getCacheCapacity(),
        dn.getFSDataset().getCacheUsed(),
        dn.getXmitsInProgress(),
        dn.getXceiverCount(),
        numFailedVolumes,
        volumeFailureSummary,
        requestBlockReportLease,
        // 慢节点报告随心跳发送给NN
        slowPeers,
        slowDisks);
    ...
  }

org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.java
报告信息到了NameNode这边,会在DatanodeManager的handleHeartbeat被处理

public DatanodeCommand[] handleHeartbeat(DatanodeRegistration nodeReg,
      StorageReport[] reports, final String blockPoolId,
      long cacheCapacity, long cacheUsed, int xceiverCount, 
      int maxTransfers, int failedVolumes,
      VolumeFailureSummary volumeFailureSummary,
      @Nonnull SlowPeerReports slowPeers,
      @Nonnull SlowDiskReports slowDisks) throws IOException {
	  ...
	  // slowPeerTracker是否为空由 dfs.datanode.peer.stats.enabled 参数决定
	  if (slowPeerTracker != null) {
      final Map<String, Double> slowPeersMap = slowPeers.getSlowPeers();
      if (!slowPeersMap.isEmpty()) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("DataNode " + nodeReg + " reported slow peers: " +
              slowPeersMap);
        }
        for (String slowNodeId : slowPeersMap.keySet()) {
          // 将慢节点信息汇总到slowPeerTracker对象中
          slowPeerTracker.addReport(slowNodeId, nodeReg.getIpcAddr(false));
        }
      }
    }
    ...
}

org.apache.hadoop.hdfs.server.blockmanagement.SlowPeerTracker.java
SlowPeerTracker的getJsonReports最终会被NN调用,用来生成慢节点json数据

private Collection<ReportForJson> getJsonReports(int numNodes) {
    ...
	// 创建队列用来对慢节点的汇报节点数量进行排序
    final PriorityQueue<ReportForJson> topNReports =
        new PriorityQueue<>(allReports.size(),
            new Comparator<ReportForJson>() {
          @Override
          public int compare(ReportForJson o1, ReportForJson o2) {
            return Ints.compare(o1.reportingNodes.size(),
                o2.reportingNodes.size());
          }
        });
    // 记录当前时间
    final long now = timer.monotonicNow();

    for (Map.Entry<String, ConcurrentMap<String, Long>> entry :
        allReports.entrySet()) {
      // 过滤掉过期的慢节点报告
      SortedSet<String> validReports = filterNodeReports(
          entry.getValue(), now);
      if (!validReports.isEmpty()) {
        // numNodes固定为5,选择Top5个节点
        if (topNReports.size() < numNodes) {
          topNReports.add(new ReportForJson(entry.getKey(), validReports));
        } else if (topNReports.peek().getReportingNodes().size() <
            validReports.size()){
          topNReports.poll();
          topNReports.add(new ReportForJson(entry.getKey(), validReports));
        }
      }
    }
    return topNReports;
}
社区相关patch

https://issues.apache.org/jira/browse/HDFS-10917(Collect peer performance statistics on DataNode.)
https://issues.apache.org/jira/browse/HDFS-11194(Maintain aggregated peer performance metrics on NameNode)

相关参数
<property>
	<name>dfs.datanode.peer.stats.enabled</name>
	<value>false</value>
	<description>A switch to turn on/off tracking DataNode peer statistics.</description>
</property>
<property>
	<name>dfs.datanode.peer.metrics.min.outlier.detection.samples</name>
	<value>1000</value>
	<description>Minimum number of packet send samples which are required to qualify for outlier detection. If the number of samples is below this then outlier detection is skipped.</description>
</property>
<property>
	<name>dfs.datanode.outliers.report.interval</name>
	<value>30m</value>
	<description>This setting controls how frequently DataNodes will report their peer latencies to the NameNode via heartbeats.</description>
</property>

磁盘慢监控

原理

监控一块磁盘变慢的原理是,记录一个节点所有磁盘的读写操作耗时,找出其中的异常值并作为慢磁盘上报给NN。

DataNode启动时,如果dfs.datanode.fileio.profiling.sampling.percentage参数大于0,会初始化一个DataNodeDiskMetrics对象,DataNodeDiskMetrics对象初始化后会启动一个后台线程,每间隔dfs.datanode.outliers.report.interval (默认30min),从DataNodeVolumeMetrics里取出每块磁盘的metadata、read、write操作平均速度,然后计算慢磁盘上报阈值upperLimitLatency(这里的计算逻辑与网络慢节点的一样),当有磁盘的某项操作平均速度大于upperLimitLatency,即认为该磁盘属于一个慢磁盘,将生成SlowDiskReports对象通过心跳上报给NN。

监控流程代码分析

org.apache.hadoop.hdfs.server.datanode.DataNode.java
首先是在DataNode的startDataNode方法,创建DataNodeDiskMetrics对象

void startDataNode(List<StorageLocation> dataDirectories,
                     SecureResources resources
                     ) throws IOException {
   ...
   // diskStatsEnabled由 dfs.datanode.fileio.profiling.sampling.percentage 参数设置
   if (dnConf.diskStatsEnabled) {
      // 创建DataNodeDiskMetrics对象
      diskMetrics = new DataNodeDiskMetrics(this,
          dnConf.outliersReportIntervalMs);
   }
   ...
}

org.apache.hadoop.hdfs.server.datanode.metrics.DataNodeDiskMetrics.java
在DataNodeDiskMetrics类中,会启动一个磁盘检查线程,计算出metadata、readIo、writeIo操作慢的磁盘

public DataNodeDiskMetrics(DataNode dn, long diskOutlierDetectionIntervalMs) {
    this.dn = dn;
    // 检查间隔时间由 dfs.datanode.outliers.report.interval 参数设置
    this.detectionInterval = diskOutlierDetectionIntervalMs;
    slowDiskDetector = new OutlierDetector(MIN_OUTLIER_DETECTION_DISKS,
        SLOW_DISK_LOW_THRESHOLD_MS);
    shouldRun = true;
    // 开启磁盘异常检查线程
    startDiskOutlierDetectionThread();
}

private void startDiskOutlierDetectionThread() {
    slowDiskDetectionDaemon = new Daemon(new Runnable() {
      @Override
      public void run() {
        while (shouldRun) {
          if (dn.getFSDataset() != null) {
            // 初始化磁盘操作耗时数据存储的Map
            Map<String, Double> metadataOpStats = Maps.newHashMap();
            Map<String, Double> readIoStats = Maps.newHashMap();
            Map<String, Double> writeIoStats = Maps.newHashMap();
            FsDatasetSpi.FsVolumeReferences fsVolumeReferences = null;
            try {
              // 获取DataNode的所有磁盘
              fsVolumeReferences = dn.getFSDataset().getFsVolumeReferences();
              Iterator<FsVolumeSpi> volumeIterator = fsVolumeReferences
                  .iterator();
              // 遍历每个磁盘
              while (volumeIterator.hasNext()) {
                FsVolumeSpi volume = volumeIterator.next();
                // 获取DataNodeVolumeMetrics对象
                DataNodeVolumeMetrics metrics = volume.getMetrics();
                // 获取磁盘路径
                String volumeName = volume.getBaseURI().getPath();

				// 将磁盘读写操作平均耗时数据存入Map
                metadataOpStats.put(volumeName,
                    metrics.getMetadataOperationMean());
                readIoStats.put(volumeName, metrics.getReadIoMean());
                writeIoStats.put(volumeName, metrics.getWriteIoMean());
              }
            } finally {
              if (fsVolumeReferences != null) {
                try {
                  fsVolumeReferences.close();
                } catch (IOException e) {
                  LOG.error("Error in releasing FS Volume references", e);
                }
              }
            }
            if (metadataOpStats.isEmpty() && readIoStats.isEmpty()
                && writeIoStats.isEmpty()) {
              LOG.debug("No disk stats available for detecting outliers.");
              continue;
            }
			// 检查是否存在慢磁盘
            detectAndUpdateDiskOutliers(metadataOpStats, readIoStats,
                writeIoStats);
          }

          try {
            Thread.sleep(detectionInterval);
          } catch (InterruptedException e) {
            LOG.error("Disk Outlier Detection thread interrupted", e);
            Thread.currentThread().interrupt();
          }
        }
      }
    });
    slowDiskDetectionDaemon.start();
  }

private void detectAndUpdateDiskOutliers(Map<String, Double> metadataOpStats,
      Map<String, Double> readIoStats, Map<String, Double> writeIoStats) {
    Map<String, Map<DiskOp, Double>> diskStats = Maps.newHashMap();

    // 获取MetadataO操作慢的磁盘
    Map<String, Double> metadataOpOutliers = slowDiskDetector
        .getOutliers(metadataOpStats);
    for (Map.Entry<String, Double> entry : metadataOpOutliers.entrySet()) {
      addDiskStat(diskStats, entry.getKey(), DiskOp.METADATA, entry.getValue());
    }

    // 获取ReadIo操作慢的磁盘
    Map<String, Double> readIoOutliers = slowDiskDetector
        .getOutliers(readIoStats);
    for (Map.Entry<String, Double> entry : readIoOutliers.entrySet()) {
      addDiskStat(diskStats, entry.getKey(), DiskOp.READ, entry.getValue());
    }

    // 获取WriteIo操作慢的磁盘
    Map<String, Double> writeIoOutliers = slowDiskDetector
        .getOutliers(writeIoStats);
    for (Map.Entry<String, Double> entry : writeIoOutliers.entrySet()) {
      addDiskStat(diskStats, entry.getKey(), DiskOp.WRITE, entry.getValue());
    }
    if (overrideStatus) {
      // 慢磁盘数据赋值给diskOutliersStats
      diskOutliersStats = diskStats;
      LOG.debug("Updated disk outliers.");
    }
}

org.apache.hadoop.hdfs.server.datanode.BPServiceActor.java
同样是BPServiceActor类在发送心跳时,从DataNodeDiskMetrics对象中取出慢磁盘数据,组成SlowDiskReports发送给NN

HeartbeatResponse sendHeartBeat(boolean requestBlockReportLease)
      throws IOException {
    ...
    // 计算是否达到间隔时间(默认30min)
    final boolean outliersReportDue = scheduler.isOutliersReportDue(now);
    final SlowPeerReports slowPeers =
        outliersReportDue && dn.getPeerMetrics() != null ?
            SlowPeerReports.create(dn.getPeerMetrics().getOutliers()) :
            SlowPeerReports.EMPTY_REPORT;
    // 是否生成慢磁盘报告
    final SlowDiskReports slowDisks =
        outliersReportDue && dn.getDiskMetrics() != null ?
            SlowDiskReports.create(dn.getDiskMetrics().getDiskOutliersStats()) :
            SlowDiskReports.EMPTY_REPORT;

    HeartbeatResponse response = bpNamenode.sendHeartbeat(bpRegistration,
        reports,
        dn.getFSDataset().getCacheCapacity(),
        dn.getFSDataset().getCacheUsed(),
        dn.getXmitsInProgress(),
        dn.getXceiverCount(),
        numFailedVolumes,
        volumeFailureSummary,
        requestBlockReportLease,
        slowPeers,
        // 慢磁盘报告随心跳发送给NN
        slowDisks);
    ...
  }

org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.java
报告信息到了NameNode这边,会在DatanodeManager的handleHeartbeat被处理

public DatanodeCommand[] handleHeartbeat(DatanodeRegistration nodeReg,
      StorageReport[] reports, final String blockPoolId,
      long cacheCapacity, long cacheUsed, int xceiverCount, 
      int maxTransfers, int failedVolumes,
      VolumeFailureSummary volumeFailureSummary,
      @Nonnull SlowPeerReports slowPeers,
      @Nonnull SlowDiskReports slowDisks) throws IOException {
	  ...
	  // slowDiskTracker是否为空由 dfs.datanode.fileio.profiling.sampling.percentage 参数决定
	  if (slowDiskTracker != null) {
      if (!slowDisks.getSlowDisks().isEmpty()) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("DataNode " + nodeReg + " reported slow disks: " +
              slowDisks.getSlowDisks());
        }
        // 慢磁盘信息存入slowDiskTracker对象
        slowDiskTracker.addSlowDiskReport(nodeReg.getIpcAddr(false), slowDisks);
      }
      slowDiskTracker.checkAndUpdateReportIfNecessary();
    }
    ...
}

org.apache.hadoop.hdfs.server.blockmanagement.SlowDiskTracker.java
SlowDiskTracker中生成慢磁盘json数据的逻辑与上文中慢节点的逻辑基本一致

private ArrayList<DiskLatency> getSlowDisks(
      Map<String, DiskLatency> reports, int numDisks, long now) {
    ...
    // 生成用来排序的队列
    final PriorityQueue<DiskLatency> topNReports = new PriorityQueue<>(
        reports.size(),
        new Comparator<DiskLatency>() {
          @Override
          public int compare(DiskLatency o1, DiskLatency o2) {
            return Doubles.compare(
                o1.getMaxLatency(), o2.getMaxLatency());
          }
        });

    ArrayList<DiskLatency> oldSlowDiskIDs = Lists.newArrayList();

    for (Map.Entry<String, DiskLatency> entry : reports.entrySet()) {
      DiskLatency diskLatency = entry.getValue();
      // 过滤过期的慢磁盘报告
      if (now - diskLatency.timestamp < reportValidityMs) {
        // numDisks固定为5,生成Top5个磁盘
        if (topNReports.size() < numDisks) {
          topNReports.add(diskLatency);
        } else if (topNReports.peek().getMaxLatency() <
            diskLatency.getMaxLatency()) {
          topNReports.poll();
          topNReports.add(diskLatency);
        }
      } else {
        oldSlowDiskIDs.add(diskLatency);
      }
    }

    oldSlowDisksCheck = oldSlowDiskIDs;

    return Lists.newArrayList(topNReports);
  }
社区相关patch

https://issues.apache.org/jira/browse/HDFS-10959(Adding per disk IO statistics and metrics in DataNode.)
https://issues.apache.org/jira/browse/HDFS-11545(Propagate DataNode’s slow disks info to the NameNode via Heartbeat)
https://issues.apache.org/jira/browse/HDFS-11551(Handle SlowDiskReport from DataNode at the NameNode)

相关参数
<property>
	<name>dfs.datanode.fileio.profiling.sampling.percentage</name>
	<value>0</value>
	<description>This setting controls the percentage of file I/O events which will be profiled for DataNode disk statistics. The default value of 0 disables disk statistics. Set to an integer value between 1 and 100 to enable disk statistics.</description>
</property>

参考资料