文件压缩带来了两大益处1)减少存贮空间2)加速网络(磁盘)传输。基于大数据的传输,都需要经过压缩处理。
压缩格式
压缩格式 | 工具 | 算法 | 文件扩展名 | 可分块 |
DEFLATE | N/A | DEFLATE | .deflate | No |
gzip | gzip | DEFLATE | .gz | No |
bzip2 | bzip2 | bzip2 | .bz2 | Yes |
LZO | lzop | LZO | .lzo | No |
Snappy | N/A | Snappy | .snappy | No |
压缩及解压缩
文件解压实例
package com.bigdata.io;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
public class FileDecompressor {
public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri),conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
// io.compression.codecs 定义列表中的一个
CompressionCodec codec = factory.getCodec(inputPath);
if(null == codec){
System.err.println("No codec for " + uri);
System.exit(-1);
}
String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
try{
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
}finally{
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
}
}
$hadoop jar stream.jar com.bigdata.io.FileDecompressor test.txt.gz
12/06/14 09:48:28 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/06/14 09:48:28 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native Libraries
Native gzip 库减少解压缩时间在50%,压缩时间在10%(同java实现的压缩算法)
禁用Native Library , 需要设置属性hadoop.native.lib 为 false
package com.bigdata.io;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CodecPool;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Compressor;
import org.apache.hadoop.util.ReflectionUtils;
public class PooledStreamCompressor {
public static void main(String[] args) throws ClassNotFoundException {
String codecClassName = args[0];
Class<?> codecClass = Class.forName(codecClassName);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
Compressor compressor = null;
try {
compressor = CodecPool.getCompressor(codec);
CompressionOutputStream out = codec.createOutputStream(System.out, compressor);
IOUtils.copyBytes(System.in, out,4096,false);
out.finish();
} catch (IOException e) {
e.printStackTrace();
}finally{
CodecPool.returnCompressor(compressor);
}
}
}
echo "Text" | hadoop jar stream.jar com.bigdata.io.PooledStreamCompressor org.apache.hadoop.io.compress.GzipCodec | gunzip -
12/06/14 10:57:49 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/06/14 10:57:49 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/06/14 10:57:49 INFO compress.CodecPool: Got brand-new compressor
Text
使用压缩格式规则
压缩格式的使用取决于你的应用程序。你想最大化您的应用程序的速度,或者是你更关注关于保持存储成本下降?在一般情况下,你应该尝试不同的策略,根据数据集和基准测试,找到最好的办法。
对于大文件,如日志文件,选项有:
1.存储成非压缩格式
2.使用一个支持块分割的压缩格式,比如bzip2(当然bzip2比较慢)或者可以使用LZO(被索引后支持块分割)
3.把应用分割成大快,每个大快可以支持任意的压缩格式(不用关心在HDFS中是否可以分割)。在这种情况,你需要计算一下你的压缩后的尺寸,要同HDFS Block的大小相匹配
4.使用Sequence 文件,支持压缩及分块
5.使用Avro数据文件,支持压缩及分块,就像Sequence文件,但比Sequence要更高级,可以被多种语言支持
对于大文件,你不应该使用不支持分块的压缩。因为,你失去了地方,使MapReduce应用非常没有效率
为了归档目的,考虑Hadoop 归档格式,虽然他不支持压缩。
MapReduce中使用压缩
步骤:
1.设置mapred.output.compress=true
2.设置mapred.output.compression.codec=the classname of the above list
或者
FileOutputFormat.setCompressOutput(job,true)
FileOutputFormat.setOutputCompressorClass(job,GzipCodec.class)
public class MaxTemperatureWithCompression {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCompression <input path> " +
"<output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
} }
对于SequenceFile格式
默认压缩类型是RECORD,建议更改为BLOCk,压缩一组record
SqeuenceFileOutputFormat.setOutputCompressionType()设置这个压缩类型
MapReduce压缩属性汇总
Property name | Type | Default value | Description |
mapred.output.compress | boolean | false | Compress outputs |
mapred.output.compression.codec | Class name | org.apache.hadoop.io.compress.DefaultCodec | The compression codec to use for outputs |
mapred.output.compression.type | String | RECORD | Type type of compression to use for SequenceFile outputs:NONE,RECORD,or BLOCK. |
Map阶段压缩输出(即中间结果是否压缩)
建议使用性能比较好的LZO、Snappy
Map 输出压缩属性汇总
Property name | Type | Default value | Description |
mapred.compress.map.output | boolean | false | Compress map outputs |
mapred.map.output.compression.codec | Class | org.apache.hadoop.io.compress.DefaultCodec | The compression codec to use for map outputs. |
Map阶段使用压缩样例
Configuration conf = new Configuration();
conf.setBoolean("mapred.compress.map.output", true);
conf.setClass("mapred.map.output.compression.codec", GzipCodec.class,
CompressionCodec.class);
Job job = new Job(conf);