As you can see, hadoop fsck and hadoop fs -dus report the effective HDFS storage space used, i.e. they show the “normal” file size (as you would see on a local filesystem) and do not account for replication in HDFS. In this case, the directory path/to/directory has stored data with a size of 16565944775310 bytes (15.1 TB). Now fsck tells us that the average replication factor for all files in path/to/directory is exactly 3.0 This means that the total raw HDFS storage space used by these files – i.e. factoring in replication – is actually: 1
3.0 x 16565944775310 (15.1 TB) = 49697834325930 Bytes (45.2 TB)
This is how much HDFS storage is consumed by files in path/to/directory
hdfs du命令是算的一份数据
If you never change the default value of 3 for the HDFS replication count of any files you store in your Hadoop cluster, this means in a nutshell that you should always multiply the numbers reported by hadoop fsck or hadoop fs -dus times 3 when you want to reason about HDFS space quotas.
参考:
http://www.michael-noll.com/blog/2011/10/20/understanding-hdfs-quotas-and-hadoop-fs-and-fsck-tools/
stackoverflow也有回答
https://stackoverflow.com/questions/11574410/how-to-find-the-size-of-a-hdfs-file
hadoop fs -dus /user/frylock/input
and you would get back the total size (in bytes) of all of the files in the "/user/frylock/input" directory.
Also, keep in mind that HDFS stores data redundantly so the actual physical storage used up by a file might be 3x or more than what is reported by hadoop fs -ls and hadoop fs -dus.
du得出的是一份数据。 如果要得到数据存储空间就是得到平均副本数, 然后平均副本数 * du得到的大小 就是数据占空间大小。