背景

通过datax往hdfs写入大型文件时报错"File could only be replicated to 0 nodes instead of 1"

[2024-06-02 05:50:20,972] {bash.py:173} INFO - Jun 02, 2024 5:50:20 AM org.apache.hadoop.hive.ql.io.orc.WriterImpl flushStripe
[2024-06-02 05:50:20,972] {bash.py:173} INFO - INFO: Padding ORC by 1953751 bytes (<=  0.03 * 67108864)
[2024-06-02 05:50:21,043] {bash.py:173} INFO - Jun 02, 2024 5:50:21 AM org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer run
[2024-06-02 05:50:21,043] {bash.py:173} INFO - WARNING: DataStreamer Exception
[2024-06-02 05:50:21,043] {bash.py:173} INFO - org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/shopping_bags/2024-06-01__1ec2c285_cb44_46c4_96f8_e5f5458d73da/shopping_bags__3a032631_8f43_4208_bf0f_39518c6016f4 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
[2024-06-02 05:50:21,043] {bash.py:173} INFO - 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1702)
[2024-06-02 05:50:21,043] {bash.py:173} INFO - 	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
[2024-06-02 05:50:21,043] {bash.py:173} INFO - 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2590)
[2024-06-02 05:50:21,043] {bash.py:173} INFO - 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:834)
[2024-06-02 05:50:21,043] {bash.py:173} INFO - 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
[2024-06-02 05:50:21,043] {bash.py:173} INFO - 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
[2024-06-02 05:50:21,043] {bash.py:173} INFO - 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
[2024-06-02 05:50:21,043] {bash.py:173} INFO - 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
[2024-06-02 05:50:21,043] {bash.py:173} INFO - 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:859)
[2024-06-02 05:50:21,044] {bash.py:173} INFO - 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:802)
[2024-06-02 05:50:21,044] {bash.py:173} INFO - 	at java.security.AccessController.doPrivileged(Native Method)
[2024-06-02 05:50:21,044] {bash.py:173} INFO - 	at javax.security.auth.Subject.doAs(Subject.java:422)
[2024-06-02 05:50:21,044] {bash.py:173} INFO - 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911)
[2024-06-02 05:50:21,044] {bash.py:173} INFO - 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2519)

原因

amazon emr中找到相关文档如下

File could only be replicated to 0 nodes instead of 1

When a file is written to HDFS, it is replicated to multiple core nodes. When you see this error,
it means that the NameNode daemon does not have any available DataNode instances to write
data to in HDFS. In other words, block replication is not taking place. This error can be caused by a
number of issues:
• The HDFS filesystem may have run out of space. This is the most likely cause.
• DataNode instances may not have been available when the job was run.
• DataNode instances may have been blocked from communication with the master node.
• Instances in the core instance group might not be available.
• Permissions may be missing. For example, the JobTracker daemon may not have permissions to
create job tracker information.
• The reserved space setting for a DataNode instance may be insufficient. Check whether this is the
case by checking the dfs.datanode.du.reserved configuration setting.

根据文档内容,检查hdfs空间,发现空间不足

(base) bieyangdeMBP-4:dags bieyang$ ssh hdfs
Last login: Mon Jul  1 10:46:47 2024

Welcome to Alibaba Cloud Elastic Compute Service !

[hdfs@emr-header-1 ~]$ hdfs dfs -df -h
Filesystem                                Size   Used  Available  Use%
hdfs://emr-header-1.cluster-206169:9000  2.9 T  2.5 T    398.8 G   87%

解决方法

直接删除hdfs中的数据,或者把hdfs上的数据备份到oss中,然后删除hdfs中的数据,清理hdfs的空间

检查hdfs中的文件后,发现主要需要清理localdb/orders_hdfs以及localdb/metrics_orders

[hdfs@emr-header-1 ~]$ hdfs dfs -du -h / | grep 'G' | sort -rn
974.9 G  /localdb
156.6 G  /user
148.4 G  /tmp
[hdfs@emr-header-1 ~]$ hdfs dfs -du -h /localdb/ | grep 'G' | sort -rn
357.2 G   /localdb/metrics_orders
268.1 G   /localdb/orders_hdfs
121.6 G   /localdb/products
115.8 G   /localdb/skus
50.2 G    /localdb/packages
20.6 G    /localdb/users
17.5 G    /localdb/product_comments
9.8 G     /localdb/follow_ups
3.8 G     /localdb/hourly_merchant_discounts
3.3 G     /localdb/shipping_progress_stages
3.2 G     /localdb/product_comment_tag
1.1 G     /localdb/merchandise_stamps

orders_hdfs

检查orders_hdfs的数据,这张表只需要保留最近8天的partition,8天之前的文件可以直接删除

[hdfs@emr-header-1 ~]$ hdfs dfs -du -h /localdb/orders_hdfs/ | grep 'G' | sort -rn
28.6 G  /localdb/orders_hdfs/ds=2024-06-30
28.6 G  /localdb/orders_hdfs/ds=2024-06-29
28.6 G  /localdb/orders_hdfs/ds=2024-06-28
28.6 G  /localdb/orders_hdfs/ds=2024-06-26
28.6 G  /localdb/orders_hdfs/ds=2024-06-25
28.6 G  /localdb/orders_hdfs/ds=2024-06-24
28.6 G  /localdb/orders_hdfs/ds=2024-06-23
26.8 G  /localdb/orders_hdfs/ds=2024-06-27
24.8 G  /localdb/orders_hdfs/ds=2023-04-19
16.3 G  /localdb/orders_hdfs/ds=2021-03-29

## 删掉多余的 orders partition
[hdfs@emr-header-1 ~]$ hdfs dfs -rm -r -skipTrash /localdb/orders_hdfs/ds=2023-04-19 /localdb/orders_hdfs/ds=2021-03-29
Deleted /localdb/orders_hdfs/ds=2023-04-19
Deleted /localdb/orders_hdfs/ds=2021-03-29
[hdfs@emr-header-1 ~]$ hdfs dfs -du -h /localdb/orders_hdfs/
28.6 G  /localdb/orders_hdfs/ds=2024-06-23
28.6 G  /localdb/orders_hdfs/ds=2024-06-24
28.6 G  /localdb/orders_hdfs/ds=2024-06-25
28.6 G  /localdb/orders_hdfs/ds=2024-06-26
26.8 G  /localdb/orders_hdfs/ds=2024-06-27
28.6 G  /localdb/orders_hdfs/ds=2024-06-28
28.6 G  /localdb/orders_hdfs/ds=2024-06-29
28.6 G  /localdb/orders_hdfs/ds=2024-06-30

## 删除后orders_hdfs占用的空间变小
[hdfs@emr-header-1 ~]$ hdfs dfs -du -h /localdb/ | grep 'G' | sort -rn
357.2 G   /localdb/metrics_orders
226.9 G   /localdb/orders_hdfs
121.6 G   /localdb/products
115.8 G   /localdb/skus
50.2 G    /localdb/packages
20.6 G    /localdb/users
17.5 G    /localdb/product_comments
9.8 G     /localdb/follow_ups
3.8 G     /localdb/hourly_merchant_discounts
3.3 G     /localdb/shipping_progress_stages
3.2 G     /localdb/product_comment_tag
1.1 G     /localdb/merchandise_stamps

metrics_orders

metrics_orders中的数据比较重要,需要先备份到oss中,然后再删除hdfs中的数据

hdfs中存储了metrics_orders很多个partition的数据,一个partition的数据大约有574M

[hdfs@emr-header-1 ~]$ hdfs dfs -du -s -h /localdb/metrics_orders/ds=2024-06-27/
574.1 M  /localdb/metrics_orders/ds=2024-06-27

编写shell脚本,先把hdfs中的数据现在到本地,再通过ossutil上传到oss

[root@emr-header-1 transfer-localdb-to-oss]# (NEW) cat transfer-localdb-to-oss.sh
#!/bin/bash

hdfs dfs -ls /localdb/metrics_orders/ | grep '/localdb/' | awk '{print $NF}' |
     while read path
     do
       folder=${path##*/}
       hdfs dfs -get $path $folder
       ossutil cp -r -u $folder oss://bxl-hive/borderx/metrics_orders/$folder
       rm -rf $folder
     done

执行完成后,检查oss中的metrics_orders文件,确认备份完成

(base) bieyangdeMBP-4:dags bieyang$ ossutil ls -d oss://bxl-hive/borderx/metrics_orders/ | tail -f
oss://bxl-hive/borderx/metrics_orders/ds=2024-06-21/
oss://bxl-hive/borderx/metrics_orders/ds=2024-06-22/
oss://bxl-hive/borderx/metrics_orders/ds=2024-06-23/
oss://bxl-hive/borderx/metrics_orders/ds=2024-06-24/
oss://bxl-hive/borderx/metrics_orders/ds=2024-06-25/
oss://bxl-hive/borderx/metrics_orders/ds=2024-06-26/
oss://bxl-hive/borderx/metrics_orders/ds=2024-06-27/
Object and Directory Number is: 1580

把hdfs上metrics_orders partitions同步到oss中后,然后在hive中执行以下操作,删除old partitions

0: jdbc:hive2://172.17.8.76:10000/borderx> ALTER TABLE borderx.metrics_orders DROP IF EXISTS PARTITION (ds<'2024-06-27') PURGE;
Dropped the partition ds=2022-08-01
Dropped the partition ds=2022-08-02
...
...
Dropped the partition ds=2024-06-26
OK
No rows affected (5.39 seconds)
0: jdbc:hive2://172.17.8.76:10000/borderx> show partitions metrics_orders;
OK
+----------------+
|   partition    |
+----------------+
| ds=2024-06-27  |
+----------------+
1 rows selected (0.132 seconds)

删除后,metrics_orders使用空间为0.5G!

[hdfs@emr-header-1 ~]$ hdfs dfs -du -s -h /localdb/metrics_orders/
574.1 M  /localdb/metrics_orders/ds=2024-06-27

再删除了一些logs和临时文件,hdfs中已使用空间占比由87%降低到65%,重新执行datax job,往hdfs中写入文件,执行成功

[hdfs@emr-header-1 ~]$ hdfs dfs -df -h
Filesystem                                Size   Used  Available  Use%
hdfs://emr-header-1.cluster-206169:9000  2.9 T  1.9 T      1.0 T   65%