数据的清洗、多map job、压缩算法

  • 一、数据的清洗
  • 案例
  • 数据
  • 效果
  • map端清洗
  • client端
  • 二、计数器工具
  • 三、串行MapReduce Job
  • 案例:
  • 数据
  • 思路
  • Map端
  • 1. 平均数map端
  • 2. 计数端map
  • 3. sum端map
  • Reduce端
  • 1. 平均数reduce端
  • 2. 计数端reduce
  • Client 端
  • 四、压缩
  • 案例:
  • Unzip
  • zip


一、数据的清洗

目的:将Flume采集到的原始数据通常都不规范,格式不符合要求,错误的无效的数据  清除

从零开始hadoop项目爬取数据并清洗 hadoop 数据清洗_etl

数据来源:
	web项目的数据(用户操作日志),数据 , app , 爬虫  ... ... 
	
需求:
	数据格式存在问题,数据内容存在问题,对数据进行过滤
	格式、数据内容、需求的数据

案例

猫眼电影网上爬取的数据(json数据)
		[
			{"":"","":"",.... .....},
			{},
			{},
			{},
			... ... ...
		]
		
	期望的数据格式:
		field1 | field1 |field1 ... ...

	需求:去除json数据格式  "{}" , ":"  , " "" "  , "[]" , "," 
		 错误的数据(缺少属性的记录)

	导包:json-lib.jar
数据
[
{"rank" : "1","name" :     "霸王别姬1",      "actors":"张国荣1,巩俐1","time":     "1993-01-01","score":"9.3"},
{"rank":"2","name":    "霸王别姬2","actors":"张国荣2,巩俐2","time":"1993-01-02","score":"9.2"},	
{"rank":"3","name":      "霸王别姬3","actors":        "张国荣3,巩俐3","time":"1993-01-03","score":"9.1"}, 
{"rank":"4","name":      "","actors":        "张国荣4,巩俐4","time":"1993-01-04","score":""}, 
{"rank":"5","name":      "霸王别姬5","actors":        "张国荣5,巩俐5","time":"1993-01-05","score":"9.5",
]
效果
1|霸王别姬1|张国荣1,巩俐1|1993-01-01|9.3
2|霸王别姬2|张国荣2,巩俐2|1993-01-02|9.2
3|霸王别姬3|张国荣3,巩俐3|1993-01-03|9.1
5|霸王别姬5|张国荣5,巩俐5|1993-01-05|9.5
map端清洗
import net.sf.json.JSONObject;
public class CleanMap extends Mapper<LongWritable, Text, Text, NullWritable> {

	@Override
	protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context)
			throws IOException, InterruptedException {
		
		//1. 获取读取到的一行数据
		String line = value.toString();
		//2. 判断当前行中是否包含[ 和 ]   或者没有数据操作
		if(line.trim().isEmpty()||line.indexOf("[")==0||line.indexOf("]")==0){
			return ;
		}
		
		//3. 删除逗号
		line = line.substring(0,line.length()-1);
		
		//补全JSON数据格式
		if(!line.endsWith("}")){
			line = line.concat("}");
		}
		System.out.println(line);
		//使用jsonAPI实现数据的切割
		JSONObject objJSON = JSONObject.fromObject(line);
		String [] datas =new String[5];
		datas[0] = objJSON.getString("rank");
		datas[1] = objJSON.getString("name");
		datas[2] = objJSON.getString("actors");
		datas[3] = objJSON.getString("time");
		datas[4] = objJSON.getString("score");
		System.out.println(line);
		//3. 判断无效数据(缺少即无效)
		for (String data : datas) {
			if(data.equals("")){
				
				//无效数据
				context.getCounter("data","dataError").increment(1);
				return;
			}
		}
		context.getCounter("data","dataSuccess").increment(1);
		//4. 设定期望的数据格式
		String result= "";
		for (String string : datas) {
			result=result+string+"|";
		}
		result = result.substring(0,result.length()-1);
		
		//5.输出结果数据
		
		Text resultText = new Text();
		resultText.set(result);
		context.write(resultText, NullWritable.get());
		
	}
	
}
client端
public class ETLFormatCheckDemo {

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		//1. 构建一个作业job对象
		Job job  = Job.getInstance(conf);
		//2. 设置job的属性
		job.setJarByClass(ETLFormatCheckDemo.class);
		
		job.setMapperClass(CleanMap.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);
		
		Path in = new Path("e://json/in");
		Path out = new Path("e://json/out");
		
		FileInputFormat.addInputPath(job, in);
		FileOutputFormat.setOutputPath(job, out);
		
		job.setNumReduceTasks(0);
		
		//3. 提交当前job
		job.waitForCompletion(true);
		
	}

}

二、计数器工具

概念:
	记录作业job的执行进度和状态的工具( 用户的操作日志 )
	在程序的某些位置插入计数器,记录数据处理进度的相关信息

作用:
	观察MR JOB 运行期的各种执行细节
	对MR 进行性能调优
	
使用:
	自定义计数器
	1. 创建计数器counter
		Counter  counter = context.getContext(String groupName,String counter);
	2. 设置初始值
		counter.setValue(int value)
	3. 修改值
		counter.increment(long num)

案例:
记录数据清洗的条数
正确的数据量

在map阶段和reduce阶段或者程序的其他阶段使用计数器

三、串行MapReduce Job

场景:
	复杂的MR处理场景下,处理过程较复杂,将当前的业务进行切割,切割成多个子job执行
	
需求:求一组数据的平均值
	10873
	334
	34
	1231
	9890734
	234
	
	将当前的任务进行划分:
		1.job1
			求所有数据的总和
			输入的数据源:统计的数据所在的文件 
				E:\avg\in
			输出结果:指定一个文件
				E:\avg\out\sum_count
		2.job2
			求所有数据的总个数
			输入的数据源:统计的数据所在的文件
				E:\avg\in
			输出结果:指定一个文件
				E:\avg\out\sum_count

		3.job3
			求所有数据的平均值
			输入的数据源:job1和job2的处理结果
				E:\avg\out\sum_count

			输出结果:指定一个文件
				E:\avg\out\avgResult

	注意:job3需要在job1和job2执行完毕后执行
		 job3操作的数据时job1和job2处理完毕后的结果数据

案例:

求一组数据的平均值,数量,总和
数据
2342
43534
1
2
34
5
436
45
7
56
3
453

结果
avg	3909
count	12
numSum	46918
思路
编码:
	创建三个job

	提交job时,需要注意顺序

	重写mapper类的方法时: 同一个切片,一个maptask中
		setup()  /  map()  / cleanup() 	

	重写reducer类的方法时:针对的数据时同一个分组中的数据
		setup()  / reduce()  / cleanup() 		
	
	注意:
		Text 类,存储字符数据 
		获取字符数据:toString()方法
Map端
1. 平均数map端
public class AvgMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

	/*
	 * long sum = 0; long count = 0;
	 */
	@Override
	protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)
			throws IOException, InterruptedException {
		// 1. 获取行数
		String line = value.toString();
		String[] datas = line.split("\t");
		long valueResult = Long.parseLong(datas[1]);		
		
		System.out.println("datas:" + datas[0] + "--->" + datas[1]);
		
		context.write(new Text(datas[0]), new LongWritable(valueResult));

	}
	/*
	 * @Override protected void cleanup(Mapper<LongWritable, Text, Text,
	 * LongWritable>.Context context) throws IOException, InterruptedException {
	 * System.out.println("clean up................");
	 * System.out.println("sum:"+sum); System.out.println("count:"+count);
	 * context.write(new Text("avg"), new LongWritable(sum/count)); }
	 */
}
2. 计数端map
public class CountMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

	@Override
	protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)
			throws IOException, InterruptedException {
		//1. 数据一行
		String line = value.toString();
		//2. 切割
		Long num = Long.parseLong(line);
		//3. 输入
		context.write(new Text("count"), new LongWritable(num));
	}

}
3. sum端map
public class SumMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

	long numSum = 0;

	@Override
	protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)
			throws IOException, InterruptedException {
		// 1. 获取列数据中的数值
		numSum += Long.parseLong(value.toString());
	}

	@Override
	protected void cleanup(Mapper<LongWritable, Text, Text, LongWritable>.Context context)
			throws IOException, InterruptedException {
		
		context.write(new Text("numSum"), new LongWritable(numSum));
	}

}
Reduce端
1. 平均数reduce端
public class AvgReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
	long sum = 0;
	long count = 0;

	@Override
	protected void reduce(Text key, Iterable<LongWritable> value,
			Reducer<Text, LongWritable, Text, LongWritable>.Context arg2) throws IOException, InterruptedException {

		//key.equals("count")
		
		
		System.out.println("key:" + key);
		String k = key.toString();
		for (LongWritable longWritable : value) {
			System.out.println("....................迭代");
			if (k.equals("count")) {
				count = longWritable.get();
				System.out.println("count =inner:" + count);
			} else {
				sum = longWritable.get();
				System.out.println("sum =inner:" + sum);
			}
		}

		System.out.println("sum=" + sum);
		System.out.println("count=" + count);

	}

	@Override
	protected void cleanup(Reducer<Text, LongWritable, Text, LongWritable>.Context context)
			throws IOException, InterruptedException {

		System.out.println("clean up................");
		System.out.println("sum:" + sum);
		System.out.println("count:" + count);
		context.write(new Text("avg"), new LongWritable(sum / count));
	}

}
2. 计数端reduce
public class CountReducer
	extends Reducer<Text, LongWritable, Text, LongWritable>{

	@Override
	protected void reduce(Text key, Iterable<LongWritable> values,
			Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
		//count 21432  count 345  count 456456....
		
		int count = 0;
		for (LongWritable longWritable : values) {
			count++;
		}
		context.write(key, new LongWritable(count));
	}
	
}
Client 端
public class AvgClient {

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();

		// job所有的路径
		Path inSrcPath = new Path("E:/avg/in");
		Path sumDesPath = new Path("E:/avg/out/sum_count/sum");
		Path countDesPath = new Path("E:/avg/out/sum_count/count");
		Path avgDesPath = new Path("E:/avg/out/avgResult");

		// 1.获取job1 (求和)
		Job job1 = Job.getInstance(conf, "job1");
		job1.setJarByClass(AvgClient.class);

		// 获取值,设置同样的key
		job1.setMapperClass(SumMapper.class);// 切割 k=1,234 k=1,444 k=1,555
		// 根据key求和

		FileInputFormat.addInputPath(job1, inSrcPath);
		FileOutputFormat.setOutputPath(job1, sumDesPath);

		job1.setOutputKeyClass(Text.class);// sum
		job1.setOutputValueClass(LongWritable.class);// 2434634

		// 2.获取job2
		Job job2 = Job.getInstance(conf, "job2");
		job2.setJarByClass(AvgClient.class);

		// 获取值,设置同样的key
		job2.setMapperClass(CountMapper.class);// 切割 k=1,234 k=1,444 k=1,555
		job2.setReducerClass(CountReducer.class);// 切割 k=1,234 k=1,444 k=1,555
		// 根据key求和

		FileInputFormat.addInputPath(job2, inSrcPath);
		FileOutputFormat.setOutputPath(job2, countDesPath);

		job2.setOutputKeyClass(Text.class);// sum
		job2.setOutputValueClass(LongWritable.class);// 2434634

		// 3.获取job3
		Job job3 = Job.getInstance(conf, "job3");
		job3.setJarByClass(AvgClient.class);

		// 获取值,设置同样的key
		job3.setMapperClass(AvgMapper.class);// 
		job3.setReducerClass(AvgReducer.class);
		// 根据key求和

		FileInputFormat.addInputPath(job3, sumDesPath);
		FileInputFormat.addInputPath(job3, countDesPath);
		FileOutputFormat.setOutputPath(job3, avgDesPath);

		job3.setOutputKeyClass(Text.class);// sum
		job3.setOutputValueClass(LongWritable.class);// 2434634

		// 4.提交job1,2,3
		// 提交job的顺序
		if(job1.waitForCompletion(true)&&job2.waitForCompletion(true)){
			if(job3==null)System.out.println("job3  null");
			job3.waitForCompletion(true);
		}
	}

}

四、压缩

好处:
	减小传输数据的大小,提高数据传输的效率	
	使用磁盘IO读取数据,提高读取效率
	数据持久化时,减少了磁盘空间	

	IO读取密集的常见,操作密集的场景不适合使用压缩

在MapReduce中压缩的位置:
	1.在map接受数据前
	2.在shuffle过程中(map和reduce的数据交互)
	3.在reduce处理完毕后


压缩格式:消耗的时间,压缩率,是否允许split切片
	在map前的压缩:是否可以切片	

	常见的压缩格式:
		deflate  hadoop自带的    .deflate    不允许切片
		bzip2    hadoop自带的    .bz2		允许
		gzip     hadoop自带的    .gz         不允许
		Snappy   需要安装        .snappy     不允许
		LZO      需要安装        .lzo        允许

编解码器:codec
	实现了一种压缩和解压缩的算法的工具类
	不同的压缩算法,对应不同的编解码器


	编解码器的类型:
		org.apache.hadoop.io.compress.BZip2Codec;
		org.apache.hadoop.io.compress.DeflateCodec;
		org.apache.hadoop.io.compress.GzipCodec;
		org.apache.hadoop.io.compress.Lz4Codec;


	//获取编解码器对象
	CompressionCodec codec = (CompressionCodec) ReflectionUtils
			.newInstance(
					Class.forName("org.apache.hadoop.io.compress.GzipCodec"),
					new Configuration());
	//获取流对象
	CompressionInputStream cis = codec.createInputStream(fis);
	CompressionOutputStream cos = codec.createOutputStream(out);

原理图示:

从零开始hadoop项目爬取数据并清洗 hadoop 数据清洗_从零开始hadoop项目爬取数据并清洗_02

1. snappy压缩 (在shuffle过程中)
	压缩PT级别的数据,速度快
	压缩率
	不支持split切片

2. LZO 压缩(在map接受数据前)
	压缩PT级别的数据,速度快
	压缩率比gzip低
	支持split切片

从零开始hadoop项目爬取数据并清洗 hadoop 数据清洗_hadoop_03

压缩的阶段案例:
	map接受数据之前:
		案例一:编码和解码
			用户编码实现处理数据的压缩

	shuffle阶段:
		案例二:
		//开启map端输出数据的压缩
			conf.setBoolean("mapreduce.map.output.compress", true);
		//设置压缩的算法
			conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
	
	reduce输出数据阶段:
		案例三:
			//通过API的方式开启压缩
			FileOutputFormat.setCompressOutput(cwJob, true);

			//设置压缩的算法
			FileOutputFormat.setOutputCompressorClass(cwJob, BZip2Codec.class);
			FileOutputFormat.setOutputCompressorClass(cwJob, GzipCodec.class);
			FileOutputFormat.setOutputCompressorClass(cwJob, BZip2Codec.class);

从零开始hadoop项目爬取数据并清洗 hadoop 数据清洗_从零开始hadoop项目爬取数据并清洗_04

案例:

Unzip
public class UnzipClient {
	public static void main(String[] args) throws Exception {
		// 1. 创建流对象
		FileInputStream fis = new FileInputStream("E:\\zip\\yy.gz");
		FileOutputStream fos = new FileOutputStream("E:\\zip\\centos.iso");
		CompressionCodec codec = (CompressionCodec) ReflectionUtils
				.newInstance(
						Class.forName("org.apache.hadoop.io.compress.GzipCodec"),
						new Configuration());
		CompressionInputStream cis = codec.createInputStream(fis);
		
		// 2. 操作数据
		IOUtils.copyBytes(cis, fos, 1024*10);
		// 3. 释放资源
		cis.close();
		fos.close();
		System.out.println("unzip over............");
	}
}
zip
/**
 * 对比多种压缩算法
 *
 */
public class ZipClient {

	public static void main(String[] args) throws Exception {
		Class[] codecClasses = new Class[5];
		codecClasses[1] = DeflateCodec.class;
		codecClasses[0] = Lz4Codec.class;
		codecClasses[2] = GzipCodec.class;
		codecClasses[3] = BZip2Codec.class;
		//codecClasses[1] = SnappyCodec.class;
		Configuration conf = new Configuration();
		// 1. 创建输入流/输出流
		for (Class class1 : codecClasses) {
			FileInputStream in = new FileInputStream("F:/other_soft/os/CentOS-7-x86_64-Minimal-1611.iso");
			long startTime = System.currentTimeMillis();

			// 构建具有压缩能力的输出流
			@SuppressWarnings("unchecked")
			CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(class1, conf);
			FileOutputStream out = new FileOutputStream("e://zip/yy" + codec.getDefaultExtension());
			CompressionOutputStream cos = codec.createOutputStream(out);
			// 2. 移动数据
			IOUtils.copyBytes(in, cos, 1024 * 5);
			// 3.释放资源
			cos.close();
			System.out.println("压缩方式:" + class1.getSimpleName() + ",消耗的时间:" + (System.currentTimeMillis() - startTime));
			System.out.println("----------------------------------------");
		}

	}

}