https://www.cloudera.com/documentation/enterprise/latest/topics/admin_data_compression_performance.html
Guidelines for Choosing a Compression Type
- GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio. GZip is often a good choice for cold data, which is accessed infrequently. Snappy or LZO are a better choice for hot data, which is accessed frequently.
- GZIP 和Snappy,LZO相比消耗更多的CPU资源,但是提供更多的压缩率。 对于冷数据来说更适合用GZIP进行压缩存储,对于热数据则采用Snappy或者LZO进行存储
- BZip2 can also produce more compression than GZip for some types of files, at the cost of some speed when compressing and decompressing. HBase does not support BZip2 compression.
- BZip2 比GZIP压缩率更高,但是压缩解压速度比较慢,Hbase不支持BZip2
- Snappy often performs better than LZO. It is worth running tests to see if you detect a significant difference.
- Snappy 性能优于LZO,但是进行测试发现重要的不同也是值得的。
- For MapReduce, if you need your compressed data to be splittable, BZip2 and LZO formats can be split. Snappy and GZip blocks are not splittable, but files with Snappy blocks inside a container file format such as SequenceFile or Avro can be split. Snappy is intended to be used with a container format, like SequenceFiles or Avro data files, rather than being used directly on plain text, for example, since the latter is not splittable and cannot be processed in parallel using MapReduce. Splittability is not relevant to HBase data.
- 对于MapReduce任务,如需要数据支持切分,可以采用 BZip2 and LZO ,Snappy and GZip blocks 不支持切分,
但是container中的SequenceFile 或者Svro格式的snappy blocks支持切分. 所以Snappy更倾向于使用能够支持切分的SequenceFile or Avro 文件格式,而不是直接采用纯文本格式,因为后者是不是切分的,并且不能并行的进行MapReduce处理 - For MapReduce, you can compress either the intermediate data, the output, or both. Adjust the parameters you provide for the MapReduce job accordingly. The following examples compress both the intermediate data and the output. MR2 is shown first, followed by MR1.
- 对于MR来说,既可以压缩中间数据,又可以压缩结果数据,根据条件对相应的job进行配置:
示例对MR任务进行中间结果,以及输出结果进行压缩:- MRv2
hadoop jar hadoop-examples-.jar sort "-Dmapreduce.compress.map.output=true"
"-Dmapreduce.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"
"-Dmapreduce.output.compress=true"
"-Dmapreduce.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey
org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output
MRv1
hadoop jar hadoop-examples-.jar sort "-Dmapred.compress.map.output=true"
"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"
"-Dmapred.output.compress=true"
"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey
org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output