选择合适的压缩格式

翻译

mtj66 2022-01-04 14:36:47

文章标签 compress 压缩 hadoop mapreduce apache 文章分类 代码人生

https://www.cloudera.com/documentation/enterprise/latest/topics/admin_data_compression_performance.html

Guidelines for Choosing a Compression Type

GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio. GZip is often a good choice for cold data, which is accessed infrequently. Snappy or LZO are a better choice for hot data, which is accessed frequently.
GZIP 和Snappy，LZO相比消耗更多的CPU资源，但是提供更多的压缩率。对于冷数据来说更适合用GZIP进行压缩存储，对于热数据则采用Snappy或者LZO进行存储
BZip2 can also produce more compression than GZip for some types of files, at the cost of some speed when compressing and decompressing. HBase does not support BZip2 compression.
BZip2 比GZIP压缩率更高，但是压缩解压速度比较慢，Hbase不支持BZip2
Snappy often performs better than LZO. It is worth running tests to see if you detect a significant difference.
Snappy 性能优于LZO，但是进行测试发现重要的不同也是值得的。
For MapReduce, if you need your compressed data to be splittable, BZip2 and LZO formats can be split. Snappy and GZip blocks are not splittable, but files with Snappy blocks inside a container file format such as SequenceFile or Avro can be split. Snappy is intended to be used with a container format, like SequenceFiles or Avro data files, rather than being used directly on plain text, for example, since the latter is not splittable and cannot be processed in parallel using MapReduce. Splittability is not relevant to HBase data.
对于MapReduce任务,如需要数据支持切分，可以采用 BZip2 and LZO ，Snappy and GZip blocks 不支持切分，
但是container中的SequenceFile 或者Svro格式的snappy blocks支持切分. 所以Snappy更倾向于使用能够支持切分的SequenceFile or Avro 文件格式，而不是直接采用纯文本格式，因为后者是不是切分的，并且不能并行的进行MapReduce处理
For MapReduce, you can compress either the intermediate data, the output, or both. Adjust the parameters you provide for the MapReduce job accordingly. The following examples compress both the intermediate data and the output. MR2 is shown first, followed by MR1.
对于MR来说，既可以压缩中间数据，又可以压缩结果数据，根据条件对相应的job进行配置：
示例对MR任务进行中间结果，以及输出结果进行压缩：
- MRv2

hadoop jar hadoop-examples-.jar sort "-Dmapreduce.compress.map.output=true"
"-Dmapreduce.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"
"-Dmapreduce.output.compress=true"
"-Dmapreduce.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey
org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output

MRv1

hadoop jar hadoop-examples-.jar sort "-Dmapred.compress.map.output=true"
"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"
"-Dmapred.output.compress=true"
"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey
org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output

上一篇：Hive 分区优化以及jon 优化

下一篇：Clouder Manager: Yarn Fair scheduler config

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

选择合适的压缩格式

选择合适的压缩格式

https://www.cloudera.com/documentation/enterprise/latest/topics/admin_data_compression_performance.html

Guidelines for Choosing a Compression Type

51CTO博客