背景
HDFS集群规模日益扩大之后,集群中难免会出现一些“慢节点“,主要表现为网络数据传输变慢、磁盘读写变慢。平常这些慢节点很难被发现,只有当业务作业数据读写涉及到这些节点,导致作业运行时间延长,我们才会发现集群读写变慢了,进而去定位具体变慢的节点。
所以慢节点一直是HDFS集群运维中需重点关注的问题,在Hadoop2.9之后,社区支持了从Namenode jmx上查看慢节点的功能。
metrics格式如下,需要注意的是这里最多展示Top5个节点/磁盘:
"SlowPeersReport":[{"SlowNode":"node4","ReportingNodes":["node1"]},{"SlowNode":"node2","ReportingNodes":["node1","node3"]},{"SlowNode":"node1","ReportingNodes":["node2"]}]
"SlowDisksReport":[{"SlowDiskID":"dn3:disk1","Latencies":{"WRITE":1000.1}},{"SlowDiskID":"dn2:disk2","Latencies":{"WRITE":1000.1}},{"SlowDiskID":"dn1:disk2","Latencies":{"READ":1000.3}},{"SlowDiskID":"dn1:disk1","Latencies":{"METADATA":1000.1,"READ":1000.8}}]
网络慢监控
原理
监控一个DN网络传输变慢的原理是,记录集群中各个DN间的packet数据传输耗时,找出其中的异常值并作为慢节点上报给NN。正常情况下各节点间的传输速率基本一致,不会相差太多,假如出现了A传B耗时异常,A就往NN上报B是慢节点。
为了计算DN往下游传数据的平均耗时,DN内部维护了一个Map<String, LinkedBlockingDeque<SumAndCount>>
,Map的key为下游DN的ip,value是一个存放SumAndCount对象的队列,SumAndCount对象用于记录往下游DN传输packet的数量与耗时。
DN在发送心跳的时候会判断是否需要生成SlowPeerReport,并将其作为心跳信息的一部分发送给NN。SlowPeerReport的生成周期由dfs.datanode.outliers.report.interval
参数控制,默认30min。首先从队列中取出所有packet传输耗时求平均值averageLatency,然后根据这些averageLatency,计算出慢节点上报阈值upperLimitLatency。当有节点的averageLatency大于upperLimitLatency,即认为该节点属于一个网络慢节点,且由DN1上报。最后生成对应的SlowPeerReport,通过心跳上报给NN。
慢节点阈值upperLimitLatency的计算逻辑
先算出所有下游DN传输耗时的中位数median,再算出中位数绝对偏差mad:
// MAD_MULTIPLIER = 1.4826
mad = median(|list[i]-median(list)|) * MAD_MULTIPLIER
最终upperLimitLatency为:
// lowThresholdMs = 5ms
upperLimitLatency = max(lowThresholdMs, median * 3, median + mad * 3)
代码详情如下:
org.apache.hadoop.hdfs.server.datanode.metrics.OutlierDetector.java
public Map<String, Double> getOutliers(Map<String, Double> stats) {
// minNumResources=10,节点少于10个不参与计算
if (stats.size() < minNumResources) {
LOG.debug("Skipping statistical outlier detection as we don't have " +
"latency data for enough resources. Have {}, need at least {}",
stats.size(), minNumResources);
return ImmutableMap.of();
}
final List<Double> sorted = new ArrayList<>(stats.values());
Collections.sort(sorted);
// 计算中位数median
final Double median = computeMedian(sorted);
// 计算中位数绝对偏差值mad
final Double mad = computeMad(sorted);
// 计算异常阈值upperLimitLatency
Double upperLimitLatency = Math.max(
lowThresholdMs, median * MEDIAN_MULTIPLIER);
upperLimitLatency = Math.max(
upperLimitLatency, median + (DEVIATION_MULTIPLIER * mad));
final Map<String, Double> slowResources = new HashMap<>();
// 找出大于异常阈值的节点
for (Map.Entry<String, Double> entry : stats.entrySet()) {
if (entry.getValue() > upperLimitLatency) {
slowResources.put(entry.getKey(), entry.getValue());
}
}
return slowResources;
}
public static Double computeMad(List<Double> sortedValues) {
...
// 计算出中位数
Double median = computeMedian(sortedValues);
List<Double> deviations = new ArrayList<>(sortedValues);
// 计算绝对偏差值
for (int i = 0; i < sortedValues.size(); ++i) {
deviations.set(i, Math.abs(sortedValues.get(i) - median));
}
Collections.sort(deviations);
// MAD_MULTIPLIER = 1.4826
return computeMedian(deviations) * MAD_MULTIPLIER;
}
public static Double computeMedian(List<Double> sortedValues) {
...
Double median = sortedValues.get(sortedValues.size() / 2);
if (sortedValues.size() % 2 == 0) {
median += sortedValues.get((sortedValues.size() / 2) - 1);
median /= 2;
}
return median;
}
监控流程代码分析
org.apache.hadoop.hdfs.server.datanode.DataNode.class
首先是在DataNode的startDataNode方法,创建DataNodePeerMetrics对象
void startDataNode(List<StorageLocation> dataDirectories,
SecureResources resources
) throws IOException {
...
// peerStatsEnabled由 dfs.datanode.peer.stats.enabled 参数决定
peerMetrics = dnConf.peerStatsEnabled ?
DataNodePeerMetrics.create(getDisplayName(), getConf()) : null;
...
}
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.java
在BlockReceiver类中记录packet传输耗时,然后写入DataNodePeerMetrics对象里
private int receivePacket() throws IOException {
...
//First write the packet to the mirror:
if (mirrorOut != null && !mirrorError) {
try {
// 记录开始时间
long begin = Time.monotonicNow();
DataNodeFaultInjector.get().stopSendingPacketDownstream(mirrorAddr);
packetReceiver.mirrorPacketTo(mirrorOut);
mirrorOut.flush();
long now = Time.monotonicNow();
this.lastSentTime.set(now);
// 计算packet传输耗时
long duration = now - begin;
DataNodeFaultInjector.get().logDelaySendingPacketDownstream(
mirrorAddr,
duration);
// 将耗时数据写入DataNodePeerMetrics
trackSendPacketToLastNodeInPipeline(duration);
if (duration > datanodeSlowLogThresholdMs && LOG.isWarnEnabled()) {
LOG.warn("Slow BlockReceiver write packet to mirror took " + duration
+ "ms (threshold=" + datanodeSlowLogThresholdMs + "ms), "
+ "downstream DNs=" + Arrays.toString(downstreamDNs)
+ ", blockId=" + replicaInfo.getBlockId());
}
} catch (IOException e) {
handleMirrorOutError(e);
}
}
...
}
private void trackSendPacketToLastNodeInPipeline(final long elapsedMs) {
// 获取DataNodePeerMetrics对象
final DataNodePeerMetrics peerMetrics = datanode.getPeerMetrics();
// peerMetrics是否为空由 dfs.datanode.peer.stats.enabled 参数决定
if (peerMetrics != null && isPenultimateNode) {
peerMetrics.addSendPacketDownstream(mirrorNameForMetrics, elapsedMs);
}
}
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.java
BPServiceActor类在发送心跳时,从DataNodePeerMetrics对象中取出慢节点数据,组成SlowDiskReports发送给NN
HeartbeatResponse sendHeartBeat(boolean requestBlockReportLease)
throws IOException {
...
// 计算是否达到间隔时间(默认30min)
final boolean outliersReportDue = scheduler.isOutliersReportDue(now);
// 是否生成慢节点报告
final SlowPeerReports slowPeers =
outliersReportDue && dn.getPeerMetrics() != null ?
SlowPeerReports.create(dn.getPeerMetrics().getOutliers()) :
SlowPeerReports.EMPTY_REPORT;
final SlowDiskReports slowDisks =
outliersReportDue && dn.getDiskMetrics() != null ?
SlowDiskReports.create(dn.getDiskMetrics().getDiskOutliersStats()) :
SlowDiskReports.EMPTY_REPORT;
HeartbeatResponse response = bpNamenode.sendHeartbeat(bpRegistration,
reports,
dn.getFSDataset().getCacheCapacity(),
dn.getFSDataset().getCacheUsed(),
dn.getXmitsInProgress(),
dn.getXceiverCount(),
numFailedVolumes,
volumeFailureSummary,
requestBlockReportLease,
// 慢节点报告随心跳发送给NN
slowPeers,
slowDisks);
...
}
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.java
报告信息到了NameNode这边,会在DatanodeManager的handleHeartbeat被处理
public DatanodeCommand[] handleHeartbeat(DatanodeRegistration nodeReg,
StorageReport[] reports, final String blockPoolId,
long cacheCapacity, long cacheUsed, int xceiverCount,
int maxTransfers, int failedVolumes,
VolumeFailureSummary volumeFailureSummary,
@Nonnull SlowPeerReports slowPeers,
@Nonnull SlowDiskReports slowDisks) throws IOException {
...
// slowPeerTracker是否为空由 dfs.datanode.peer.stats.enabled 参数决定
if (slowPeerTracker != null) {
final Map<String, Double> slowPeersMap = slowPeers.getSlowPeers();
if (!slowPeersMap.isEmpty()) {
if (LOG.isDebugEnabled()) {
LOG.debug("DataNode " + nodeReg + " reported slow peers: " +
slowPeersMap);
}
for (String slowNodeId : slowPeersMap.keySet()) {
// 将慢节点信息汇总到slowPeerTracker对象中
slowPeerTracker.addReport(slowNodeId, nodeReg.getIpcAddr(false));
}
}
}
...
}
org.apache.hadoop.hdfs.server.blockmanagement.SlowPeerTracker.java
SlowPeerTracker的getJsonReports最终会被NN调用,用来生成慢节点json数据
private Collection<ReportForJson> getJsonReports(int numNodes) {
...
// 创建队列用来对慢节点的汇报节点数量进行排序
final PriorityQueue<ReportForJson> topNReports =
new PriorityQueue<>(allReports.size(),
new Comparator<ReportForJson>() {
@Override
public int compare(ReportForJson o1, ReportForJson o2) {
return Ints.compare(o1.reportingNodes.size(),
o2.reportingNodes.size());
}
});
// 记录当前时间
final long now = timer.monotonicNow();
for (Map.Entry<String, ConcurrentMap<String, Long>> entry :
allReports.entrySet()) {
// 过滤掉过期的慢节点报告
SortedSet<String> validReports = filterNodeReports(
entry.getValue(), now);
if (!validReports.isEmpty()) {
// numNodes固定为5,选择Top5个节点
if (topNReports.size() < numNodes) {
topNReports.add(new ReportForJson(entry.getKey(), validReports));
} else if (topNReports.peek().getReportingNodes().size() <
validReports.size()){
topNReports.poll();
topNReports.add(new ReportForJson(entry.getKey(), validReports));
}
}
}
return topNReports;
}
社区相关patch
https://issues.apache.org/jira/browse/HDFS-10917(Collect peer performance statistics on DataNode.)
https://issues.apache.org/jira/browse/HDFS-11194(Maintain aggregated peer performance metrics on NameNode)
相关参数
<property>
<name>dfs.datanode.peer.stats.enabled</name>
<value>false</value>
<description>A switch to turn on/off tracking DataNode peer statistics.</description>
</property>
<property>
<name>dfs.datanode.peer.metrics.min.outlier.detection.samples</name>
<value>1000</value>
<description>Minimum number of packet send samples which are required to qualify for outlier detection. If the number of samples is below this then outlier detection is skipped.</description>
</property>
<property>
<name>dfs.datanode.outliers.report.interval</name>
<value>30m</value>
<description>This setting controls how frequently DataNodes will report their peer latencies to the NameNode via heartbeats.</description>
</property>
磁盘慢监控
原理
监控一块磁盘变慢的原理是,记录一个节点所有磁盘的读写操作耗时,找出其中的异常值并作为慢磁盘上报给NN。
DataNode启动时,如果dfs.datanode.fileio.profiling.sampling.percentage
参数大于0,会初始化一个DataNodeDiskMetrics对象,DataNodeDiskMetrics对象初始化后会启动一个后台线程,每间隔dfs.datanode.outliers.report.interval
(默认30min),从DataNodeVolumeMetrics里取出每块磁盘的metadata、read、write操作平均速度,然后计算慢磁盘上报阈值upperLimitLatency(这里的计算逻辑与网络慢节点的一样),当有磁盘的某项操作平均速度大于upperLimitLatency,即认为该磁盘属于一个慢磁盘,将生成SlowDiskReports对象通过心跳上报给NN。
监控流程代码分析
org.apache.hadoop.hdfs.server.datanode.DataNode.java
首先是在DataNode的startDataNode方法,创建DataNodeDiskMetrics对象
void startDataNode(List<StorageLocation> dataDirectories,
SecureResources resources
) throws IOException {
...
// diskStatsEnabled由 dfs.datanode.fileio.profiling.sampling.percentage 参数设置
if (dnConf.diskStatsEnabled) {
// 创建DataNodeDiskMetrics对象
diskMetrics = new DataNodeDiskMetrics(this,
dnConf.outliersReportIntervalMs);
}
...
}
org.apache.hadoop.hdfs.server.datanode.metrics.DataNodeDiskMetrics.java
在DataNodeDiskMetrics类中,会启动一个磁盘检查线程,计算出metadata、readIo、writeIo操作慢的磁盘
public DataNodeDiskMetrics(DataNode dn, long diskOutlierDetectionIntervalMs) {
this.dn = dn;
// 检查间隔时间由 dfs.datanode.outliers.report.interval 参数设置
this.detectionInterval = diskOutlierDetectionIntervalMs;
slowDiskDetector = new OutlierDetector(MIN_OUTLIER_DETECTION_DISKS,
SLOW_DISK_LOW_THRESHOLD_MS);
shouldRun = true;
// 开启磁盘异常检查线程
startDiskOutlierDetectionThread();
}
private void startDiskOutlierDetectionThread() {
slowDiskDetectionDaemon = new Daemon(new Runnable() {
@Override
public void run() {
while (shouldRun) {
if (dn.getFSDataset() != null) {
// 初始化磁盘操作耗时数据存储的Map
Map<String, Double> metadataOpStats = Maps.newHashMap();
Map<String, Double> readIoStats = Maps.newHashMap();
Map<String, Double> writeIoStats = Maps.newHashMap();
FsDatasetSpi.FsVolumeReferences fsVolumeReferences = null;
try {
// 获取DataNode的所有磁盘
fsVolumeReferences = dn.getFSDataset().getFsVolumeReferences();
Iterator<FsVolumeSpi> volumeIterator = fsVolumeReferences
.iterator();
// 遍历每个磁盘
while (volumeIterator.hasNext()) {
FsVolumeSpi volume = volumeIterator.next();
// 获取DataNodeVolumeMetrics对象
DataNodeVolumeMetrics metrics = volume.getMetrics();
// 获取磁盘路径
String volumeName = volume.getBaseURI().getPath();
// 将磁盘读写操作平均耗时数据存入Map
metadataOpStats.put(volumeName,
metrics.getMetadataOperationMean());
readIoStats.put(volumeName, metrics.getReadIoMean());
writeIoStats.put(volumeName, metrics.getWriteIoMean());
}
} finally {
if (fsVolumeReferences != null) {
try {
fsVolumeReferences.close();
} catch (IOException e) {
LOG.error("Error in releasing FS Volume references", e);
}
}
}
if (metadataOpStats.isEmpty() && readIoStats.isEmpty()
&& writeIoStats.isEmpty()) {
LOG.debug("No disk stats available for detecting outliers.");
continue;
}
// 检查是否存在慢磁盘
detectAndUpdateDiskOutliers(metadataOpStats, readIoStats,
writeIoStats);
}
try {
Thread.sleep(detectionInterval);
} catch (InterruptedException e) {
LOG.error("Disk Outlier Detection thread interrupted", e);
Thread.currentThread().interrupt();
}
}
}
});
slowDiskDetectionDaemon.start();
}
private void detectAndUpdateDiskOutliers(Map<String, Double> metadataOpStats,
Map<String, Double> readIoStats, Map<String, Double> writeIoStats) {
Map<String, Map<DiskOp, Double>> diskStats = Maps.newHashMap();
// 获取MetadataO操作慢的磁盘
Map<String, Double> metadataOpOutliers = slowDiskDetector
.getOutliers(metadataOpStats);
for (Map.Entry<String, Double> entry : metadataOpOutliers.entrySet()) {
addDiskStat(diskStats, entry.getKey(), DiskOp.METADATA, entry.getValue());
}
// 获取ReadIo操作慢的磁盘
Map<String, Double> readIoOutliers = slowDiskDetector
.getOutliers(readIoStats);
for (Map.Entry<String, Double> entry : readIoOutliers.entrySet()) {
addDiskStat(diskStats, entry.getKey(), DiskOp.READ, entry.getValue());
}
// 获取WriteIo操作慢的磁盘
Map<String, Double> writeIoOutliers = slowDiskDetector
.getOutliers(writeIoStats);
for (Map.Entry<String, Double> entry : writeIoOutliers.entrySet()) {
addDiskStat(diskStats, entry.getKey(), DiskOp.WRITE, entry.getValue());
}
if (overrideStatus) {
// 慢磁盘数据赋值给diskOutliersStats
diskOutliersStats = diskStats;
LOG.debug("Updated disk outliers.");
}
}
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.java
同样是BPServiceActor类在发送心跳时,从DataNodeDiskMetrics对象中取出慢磁盘数据,组成SlowDiskReports发送给NN
HeartbeatResponse sendHeartBeat(boolean requestBlockReportLease)
throws IOException {
...
// 计算是否达到间隔时间(默认30min)
final boolean outliersReportDue = scheduler.isOutliersReportDue(now);
final SlowPeerReports slowPeers =
outliersReportDue && dn.getPeerMetrics() != null ?
SlowPeerReports.create(dn.getPeerMetrics().getOutliers()) :
SlowPeerReports.EMPTY_REPORT;
// 是否生成慢磁盘报告
final SlowDiskReports slowDisks =
outliersReportDue && dn.getDiskMetrics() != null ?
SlowDiskReports.create(dn.getDiskMetrics().getDiskOutliersStats()) :
SlowDiskReports.EMPTY_REPORT;
HeartbeatResponse response = bpNamenode.sendHeartbeat(bpRegistration,
reports,
dn.getFSDataset().getCacheCapacity(),
dn.getFSDataset().getCacheUsed(),
dn.getXmitsInProgress(),
dn.getXceiverCount(),
numFailedVolumes,
volumeFailureSummary,
requestBlockReportLease,
slowPeers,
// 慢磁盘报告随心跳发送给NN
slowDisks);
...
}
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.java
报告信息到了NameNode这边,会在DatanodeManager的handleHeartbeat被处理
public DatanodeCommand[] handleHeartbeat(DatanodeRegistration nodeReg,
StorageReport[] reports, final String blockPoolId,
long cacheCapacity, long cacheUsed, int xceiverCount,
int maxTransfers, int failedVolumes,
VolumeFailureSummary volumeFailureSummary,
@Nonnull SlowPeerReports slowPeers,
@Nonnull SlowDiskReports slowDisks) throws IOException {
...
// slowDiskTracker是否为空由 dfs.datanode.fileio.profiling.sampling.percentage 参数决定
if (slowDiskTracker != null) {
if (!slowDisks.getSlowDisks().isEmpty()) {
if (LOG.isDebugEnabled()) {
LOG.debug("DataNode " + nodeReg + " reported slow disks: " +
slowDisks.getSlowDisks());
}
// 慢磁盘信息存入slowDiskTracker对象
slowDiskTracker.addSlowDiskReport(nodeReg.getIpcAddr(false), slowDisks);
}
slowDiskTracker.checkAndUpdateReportIfNecessary();
}
...
}
org.apache.hadoop.hdfs.server.blockmanagement.SlowDiskTracker.java
SlowDiskTracker中生成慢磁盘json数据的逻辑与上文中慢节点的逻辑基本一致
private ArrayList<DiskLatency> getSlowDisks(
Map<String, DiskLatency> reports, int numDisks, long now) {
...
// 生成用来排序的队列
final PriorityQueue<DiskLatency> topNReports = new PriorityQueue<>(
reports.size(),
new Comparator<DiskLatency>() {
@Override
public int compare(DiskLatency o1, DiskLatency o2) {
return Doubles.compare(
o1.getMaxLatency(), o2.getMaxLatency());
}
});
ArrayList<DiskLatency> oldSlowDiskIDs = Lists.newArrayList();
for (Map.Entry<String, DiskLatency> entry : reports.entrySet()) {
DiskLatency diskLatency = entry.getValue();
// 过滤过期的慢磁盘报告
if (now - diskLatency.timestamp < reportValidityMs) {
// numDisks固定为5,生成Top5个磁盘
if (topNReports.size() < numDisks) {
topNReports.add(diskLatency);
} else if (topNReports.peek().getMaxLatency() <
diskLatency.getMaxLatency()) {
topNReports.poll();
topNReports.add(diskLatency);
}
} else {
oldSlowDiskIDs.add(diskLatency);
}
}
oldSlowDisksCheck = oldSlowDiskIDs;
return Lists.newArrayList(topNReports);
}
社区相关patch
https://issues.apache.org/jira/browse/HDFS-10959(Adding per disk IO statistics and metrics in DataNode.)
https://issues.apache.org/jira/browse/HDFS-11545(Propagate DataNode’s slow disks info to the NameNode via Heartbeat)
https://issues.apache.org/jira/browse/HDFS-11551(Handle SlowDiskReport from DataNode at the NameNode)
相关参数
<property>
<name>dfs.datanode.fileio.profiling.sampling.percentage</name>
<value>0</value>
<description>This setting controls the percentage of file I/O events which will be profiled for DataNode disk statistics. The default value of 0 disables disk statistics. Set to an integer value between 1 and 100 to enable disk statistics.</description>
</property>
参考资料