最近公司生产SaaS环境下的spark任务总是失败,发现部分节点上的executor任务失败次数过高,问题的排查也是较为曲折,故将此次疑难杂症的排查过程记录一下。

问题如下:

Spark疑难杂症排查-Premature EOF: no length prefix available_服务器

Spark疑难杂症排查-Premature EOF: no length prefix available_hadoop_02

日志如下:

22/07/22 10:20:28 WARN DFSClient: Error Recovery for block BP-888461729-172.16.34.148-1397820377004:blk_15089246483_16183344527 in pipeline 172.16.34.64:50010, 172.16.34.223:50010: bad datanode 172.16.34.64:50010 [DataStreamer for file /bdp/data/u9083189ae0349dbaf8f2f2ee3351579_temp/.hive-staging_hive_2022-07-22_10-16-47_909_6320972307505128897-167/-ext-10000/_temporary/0/_temporary/attempt_20220722101347_2794_m_000034_0/part-00034-c0b2f98a-7049-4a5d-bcf3-b3042cd16645-c000 block BP-888461729-172.16.34.148-1397820377004:blk_15089246483_16183344527]
22/07/22 10:21:07 WARN DFSClient: Slow ReadProcessor read fields took 51412ms (threshold=30000ms); ack: seqno: 10 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 1027966, targets: [172.16.34.64:50010, 172.16.34.95:50010] [ResponseProcessor for block BP-888461729-172.16.34.148-1397820377004:blk_15089250682_16183349712]
22/07/22 10:22:18 WARN DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-888461729-172.16.34.148-1397820377004:blk_15089250682_16183349712 [ResponseProcessor for block BP-888461729-172.16.34.148-1397820377004:blk_15089250682_16183349712]
java.io.EOFException: Premature EOF: no length prefix available
	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2207)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867)
22/07/22 10:23:09 WARN DFSClient: DataStreamer Exception [DataStreamer for file /bdp/data/9ca8f3a33ce543909594541286116aec_temp/.hive-staging_hive_2022-07-22_10-19-47_845_1102463989488675558-143/-ext-10000/_temporary/0/_temporary/attempt_20220722101658_2821_m_000049_0/part-00049-b0b65b5f-f8f0-4b60-9e72-04dde1343ede-c000 block BP-888461729-172.16.34.148-1397820377004:blk_15089250682_16183349712]
java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
	at java.io.DataOutputStream.write(DataOutputStream.java:107)
	at org.apache.hadoop.hdfs.DFSOutputStream$Packet.writeTo(DFSOutputStream.java:321)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:643)
22/07/22 10:23:09 WARN DFSClient: Error Recovery for block BP-888461729-172.16.34.148-1397820377004:blk_15089250682_16183349712 in pipeline 172.16.34.64:50010, 172.16.34.95:50010: bad datanode 172.16.34.64:50010 [DataStreamer for file /bdp/data/9ca8f3a33ce543909594541286116aec_temp/.hive-staging_hive_2022-07-22_10-19-47_845_1102463989488675558-143/-ext-10000/_temporary/0/_temporary/attempt_20220722101658_2821_m_000049_0/part-00049-b0b65b5f-f8f0-4b60-9e72-04dde1343ede-c000 block BP-888461729-172.16.34.148-1397820377004:blk_15089250682_16183349712]

看executor错误日志显示是 DFSClient :Premature EOF: no length prefix available 导致异常重试

于是开始查看此节点上的datanode日志:

Spark疑难杂症排查-Premature EOF: no length prefix available_spark_03

日志如下:

2022-07-21 12:04:27,437 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-888461729-172.16.34.148-1397820377004:blk_15087313175_16181097975, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
2022-07-21 12:04:27,437 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-888461729-172.16.34.148-1397820377004:blk_15087313175_16181097975
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:496)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:889)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:757)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:239)
        at java.lang.Thread.run(Thread.java:745)
2022-07-21 12:04:27,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-888461729-172.16.34.148-1397820377004:blk_15087313175_16181097975 received exception java.io.IOException: Premature EOF from inputStream
2022-07-21 12:04:27,438 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: bdp-171:50010:DataXceiver error processing WRITE_BLOCK operation  src: /172.16.34.214:9107 dst: /172.16.34.214:50010
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:496)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:889)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:757)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:239)
        at java.lang.Thread.run(Thread.java:745)
2022-07-21 12:04:27,441 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-888461729-172.16.34.148-1397820377004:blk_15087313175_16181097975 src: /172.16.34.214:10348 dest: /172.16.34.214:50010
2022-07-21 12:04:27,441 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-888461729-172.16.34.148-1397820377004:blk_15087313175_16181097975
2022-07-21 12:04:27,441 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_15087313175_16181097975, RBW
  getNumBytes()     = 1040384
  getBytesOnDisk()  = 1040384
  getVisibleLength()= 0
  getVolume()       = /data3/hdfs/data/current
  getBlockFile()    = /data3/hdfs/data/current/BP-888461729-172.16.34.148-1397820377004/current/rbw/blk_15087313175
  bytesAcked=0
  bytesOnDisk=1040384

以前未见过这种异常信息,于是根据datanode的异常信息,google到了此文章:https://stackoverflow.com/questions/32060049/hadoop-mapreduce-job-i-o-exception-due-to-premature-eof-from-inputstream

Spark疑难杂症排查-Premature EOF: no length prefix available_大数据_04

简单来讲:过早的 EOF 可能由于多种原因而发生,其中之一是使用 FileOutputCommitter 在一个 reducer 节点上生成大量线程以写入磁盘。MultipleOutputs 类允许您写入具有自定义名称的文件并完成此操作,它为每个文件生成一个线程并将端口绑定到它以写入磁盘,此配置是:dfs.datanode.max.transfer.threads,社区默认值为256,该参数取值范围[1 to 8192]

于是先来查看hdfs-site.xml文件:发现已经配置,且本身设置的值就不小

Spark疑难杂症排查-Premature EOF: no length prefix available_大数据_05

这就有点诡异了,再猜测有没有可能是文件句柄数过小导致的:

# 查看文件句柄数
ulimit -n
# 结果
1024000

发现句柄数也没问题

后面有看到一篇文章可能因为丢包导致,于是查看此节点是否存在丢包,发现也不是网络问题

Spark疑难杂症排查-Premature EOF: no length prefix available_spark_06

最后又想了想会不会与磁盘相关,因为此物理机已经使用很久了,会不会是磁盘问题,于是执行一下命令:

1、正常环境

Spark疑难杂症排查-Premature EOF: no length prefix available_spark_07

2、问题环境

Spark疑难杂症排查-Premature EOF: no length prefix available_spark_08

最终发现问题环境下的磁盘读写速度巨慢,大致可以确定是磁盘问题。

总结:软件行业一半以上的疑难杂症记得找运维。