hdfs du命令是算的一份数据

原创

bonelee 2023-07-04 19:23:10 ©著作权

文章标签 分布式 hadoop HDFS sed 文章分类 HarmonyOS 后端开发

©著作权归作者所有：来自51CTO博客作者bonelee的原创作品，请联系作者获取转载授权，否则将追究法律责任

As you can see, hadoop fsck and hadoop fs -dus report the effective HDFS storage space used, i.e. they show the “normal” file size (as you would see on a local filesystem) and do not account for replication in HDFS. In this case, the directory path/to/directory has stored data with a size of 16565944775310 bytes (15.1 TB). Now fsck tells us that the average replication factor for all files in path/to/directory is exactly 3.0 This means that the total raw HDFS storage space used by these files – i.e. factoring in replication – is actually: 1
3.0 x 16565944775310 (15.1 TB) = 49697834325930 Bytes (45.2 TB)
This is how much HDFS storage is consumed by files in path/to/directory

hdfs du命令是算的一份数据

If you never change the default value of 3 for the HDFS replication count of any files you store in your Hadoop cluster, this means in a nutshell that you should always multiply the numbers reported by hadoop fsck or hadoop fs -dus times 3 when you want to reason about HDFS space quotas.

参考：

http://www.michael-noll.com/blog/2011/10/20/understanding-hdfs-quotas-and-hadoop-fs-and-fsck-tools/

stackoverflow也有回答

https://stackoverflow.com/questions/11574410/how-to-find-the-size-of-a-hdfs-file

hadoop fs -dus /user/frylock/input
and you would get back the total size (in bytes) of all of the files in the "/user/frylock/input" directory.

Also, keep in mind that HDFS stores data redundantly so the actual physical storage used up by a file might be 3x or more than what is reported by hadoop fs -ls and hadoop fs -dus.

du得出的是一份数据。如果要得到数据存储空间就是得到平均副本数，然后平均副本数 * du得到的大小就是数据占空间大小。