hadoop文件夹中全是gz压缩文件 hadoop数据压缩

转载

mob64ca140eb362 2024-04-19 16:14:11

文章标签 hadoop文件夹中全是gz压缩文件 hadoop mapreduce 大数据 apache 文章分类 Hadoop 大数据

Hadoop 3.x（MapReduce）----【Hadoop 数据压缩】

1. 概述

1. 压缩的好处和坏处
2. 压缩原则

2. MR支持的压缩编码
3. 压缩方式选择

1. Gzip压缩
2. Bzip2压缩
3.Lzo压缩
4. Snappy压缩
5. 压缩位置选择

4. 压缩参数配置
5. 压缩实操案例

1. Map输出端采用压缩
2. Reduce输出端采用压缩

1. 概述

1. 压缩的好处和坏处

压缩的优点：以减少磁盘 IO、减少磁盘存储空间。
压缩的缺点：增加 CPU 开销。

2. 压缩原则

运算密集型的 Job，少用压缩
IO 密集型的 Job，多用压缩

2. MR支持的压缩编码

压缩算法对比介绍

压缩格式	Hadoop自带？	算法	文件扩展名	是否可切片	换成压缩格式后，原来的程序是否需要修改
DEFLATE	是，直接使用	DEFLATE	.deflate	否	和文本处理一样，不需要修改
Gzip	是，直接使用	DEFLATE	.gz	否	和文本处理一样，不需要修改
bzip2	是，直接使用	bzip2	.bz2	是	和文本处理一样，不需要修改
LZO	否，需要安装	LZO	.lzo	是	需要建索引，还需要指定输入格式
Snappy	是，直接使用	Snappy	.snappy	否	和文本处理一样，不需要修改

hadoop文件夹中全是gz压缩文件 hadoop数据压缩_大数据

hadoop文件夹中全是gz压缩文件 hadoop数据压缩_apache_02

压缩性能的比较

压缩算法	原始文件大小	压缩文件大小	压缩速度	解压速度
gzip	8.3GB	1.8GB	17.5MB/s	58MB/s
bzip2	8.3GB	1.1GB	2.4MB/s	9.5MB/s
LZO	8.3GB	2.9GB	49.3MB/s	74.6MB/s

hadoop文件夹中全是gz压缩文件 hadoop数据压缩_大数据_03

3. 压缩方式选择

压缩方式选择时重点考虑：压缩/解压速度、压缩率（压缩后存储大小）、压缩后是否可以支持切片。

1. Gzip压缩

优点：压缩率比较高
缺点：不支持 Split（切片），压缩/解压速度一般

2. Bzip2压缩

优点：压缩率高，支持 Split
缺点：压缩/解压速度慢

3.Lzo压缩

优点：压缩/解压速度比较快，支持 Split
缺点：压缩率一般，支持切片需要额外创建索引

4. Snappy压缩

优点：压缩和解压速度快
缺点：不支持 Split，压缩率一般

5. 压缩位置选择

压缩可以在 MapReduce 作用的任意阶段启用

hadoop文件夹中全是gz压缩文件 hadoop数据压缩_mapreduce_04

4. 压缩参数配置

为了支持多种压缩/解压算法，Hadoop 引入了编码/解码器

压缩格式	对应的编码/解码器
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec
gzip	org.apache.hadoop.io.compress.GzipCodec
bzip2	org.apache.hadoop.io.compress.BZip2Codec
LZO	com.hadoop.compression.lzo.LzopCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec

hadoop文件夹中全是gz压缩文件 hadoop数据压缩_apache_05

要在 Hadoop 中启用压缩，可以配置如下参数

hadoop文件夹中全是gz压缩文件 hadoop数据压缩_大数据_06

hadoop文件夹中全是gz压缩文件 hadoop数据压缩_apache_07

5. 压缩实操案例

1. Map输出端采用压缩

即使你的 MapReduce 的输入输出文件都是未压缩的文件，你仍然可以对 Map 任务的中间结果输出做压缩，因为它要卸载硬盘并且通过网络传输到 Reduce 节点，对其压缩可以提高很多性能，这些工作只要设置两个属性即可，我们看下代码是如何设置。

给大家提供的 Hadoop 源码支持的压缩格式有：BZip2Code、DefaultCode

package com.fickler.mapreduce.compress;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author dell
 * @version 1.0
 */
public class WordCountDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

        Configuration configuration = new Configuration();

        //开启map端输出压缩
        configuration.setBoolean("mapreduce.map.output.compress", true);

        //设置map端输出压缩方式
        configuration.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);

        Job job = Job.getInstance(configuration);

        job.setJarByClass(WordCountDriver.class);

        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job, new Path("C:\\Users\\dell\\Desktop\\input"));
        FileOutputFormat.setOutputPath(job, new Path("C:\\Users\\dell\\Desktop\\output"));

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);

    }

}

Mapper 保持不变

package com.fickler.mapreduce.compress;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author dell
 * @version 1.0
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    Text k = new Text();
    IntWritable v = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {

        String line = value.toString();
        String[] split = line.split(" ");
        for (String word : split){
            k.set(word);
            context.write(k, v);
        }

    }
}

Reducer 保持不变

package com.fickler.mapreduce.compress;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author dell
 * @version 1.0
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    IntWritable v = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {

        int sum = 0;
        for (IntWritable value : values){
            sum += value.get();
        }

        v.set(sum);

        context.write(key, v);

    }
}

2. Reduce输出端采用压缩

修改驱动

package com.fickler.mapreduce.compress;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.DefaultCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author dell
 * @version 1.0
 */
public class WordCountDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

        Configuration configuration = new Configuration();

        //开启map端输出压缩
        configuration.setBoolean("mapreduce.map.output.compress", true);

        //设置map端输出压缩方式
        configuration.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);

        Job job = Job.getInstance(configuration);

        job.setJarByClass(WordCountDriver.class);

        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job, new Path("C:\\Users\\dell\\Desktop\\input"));
        FileOutputFormat.setOutputPath(job, new Path("C:\\Users\\dell\\Desktop\\output"));

        //设置reduce端输出压缩开启
        FileOutputFormat.setCompressOutput(job, true);

        //设置压缩的方式
//        FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
//        FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
        FileOutputFormat.setOutputCompressorClass(job, DefaultCodec.class);

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);

    }

}

hadoop文件夹中全是gz压缩文件 hadoop数据压缩_大数据_08

hadoop文件夹中全是gz压缩文件 hadoop数据压缩_apache_09

hadoop文件夹中全是gz压缩文件 hadoop数据压缩_hadoop文件夹中全是gz压缩文件_10

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：python 读取ElementTree 保留注释设置 python获取treeview所有数据

下一篇：VM虚拟机不能使用GPU vmware虚拟机 gpu

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯