数据的清洗、多map job、压缩算法
- 一、数据的清洗
- 案例
- 数据
- 效果
- map端清洗
- client端
- 二、计数器工具
- 三、串行MapReduce Job
- 案例:
- 数据
- 思路
- Map端
- 1. 平均数map端
- 2. 计数端map
- 3. sum端map
- Reduce端
- 1. 平均数reduce端
- 2. 计数端reduce
- Client 端
- 四、压缩
- 案例:
- Unzip
- zip
一、数据的清洗
目的:将Flume采集到的原始数据通常都不规范,格式不符合要求,错误的无效的数据 清除
数据来源:
web项目的数据(用户操作日志),数据 , app , 爬虫 ... ...
需求:
数据格式存在问题,数据内容存在问题,对数据进行过滤
格式、数据内容、需求的数据
案例
猫眼电影网上爬取的数据(json数据)
[
{"":"","":"",.... .....},
{},
{},
{},
... ... ...
]
期望的数据格式:
field1 | field1 |field1 ... ...
需求:去除json数据格式 "{}" , ":" , " "" " , "[]" , ","
错误的数据(缺少属性的记录)
导包:json-lib.jar
数据
[
{"rank" : "1","name" : "霸王别姬1", "actors":"张国荣1,巩俐1","time": "1993-01-01","score":"9.3"},
{"rank":"2","name": "霸王别姬2","actors":"张国荣2,巩俐2","time":"1993-01-02","score":"9.2"},
{"rank":"3","name": "霸王别姬3","actors": "张国荣3,巩俐3","time":"1993-01-03","score":"9.1"},
{"rank":"4","name": "","actors": "张国荣4,巩俐4","time":"1993-01-04","score":""},
{"rank":"5","name": "霸王别姬5","actors": "张国荣5,巩俐5","time":"1993-01-05","score":"9.5",
]
效果
1|霸王别姬1|张国荣1,巩俐1|1993-01-01|9.3
2|霸王别姬2|张国荣2,巩俐2|1993-01-02|9.2
3|霸王别姬3|张国荣3,巩俐3|1993-01-03|9.1
5|霸王别姬5|张国荣5,巩俐5|1993-01-05|9.5
map端清洗
import net.sf.json.JSONObject;
public class CleanMap extends Mapper<LongWritable, Text, Text, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context)
throws IOException, InterruptedException {
//1. 获取读取到的一行数据
String line = value.toString();
//2. 判断当前行中是否包含[ 和 ] 或者没有数据操作
if(line.trim().isEmpty()||line.indexOf("[")==0||line.indexOf("]")==0){
return ;
}
//3. 删除逗号
line = line.substring(0,line.length()-1);
//补全JSON数据格式
if(!line.endsWith("}")){
line = line.concat("}");
}
System.out.println(line);
//使用jsonAPI实现数据的切割
JSONObject objJSON = JSONObject.fromObject(line);
String [] datas =new String[5];
datas[0] = objJSON.getString("rank");
datas[1] = objJSON.getString("name");
datas[2] = objJSON.getString("actors");
datas[3] = objJSON.getString("time");
datas[4] = objJSON.getString("score");
System.out.println(line);
//3. 判断无效数据(缺少即无效)
for (String data : datas) {
if(data.equals("")){
//无效数据
context.getCounter("data","dataError").increment(1);
return;
}
}
context.getCounter("data","dataSuccess").increment(1);
//4. 设定期望的数据格式
String result= "";
for (String string : datas) {
result=result+string+"|";
}
result = result.substring(0,result.length()-1);
//5.输出结果数据
Text resultText = new Text();
resultText.set(result);
context.write(resultText, NullWritable.get());
}
}
client端
public class ETLFormatCheckDemo {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
//1. 构建一个作业job对象
Job job = Job.getInstance(conf);
//2. 设置job的属性
job.setJarByClass(ETLFormatCheckDemo.class);
job.setMapperClass(CleanMap.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
Path in = new Path("e://json/in");
Path out = new Path("e://json/out");
FileInputFormat.addInputPath(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setNumReduceTasks(0);
//3. 提交当前job
job.waitForCompletion(true);
}
}
二、计数器工具
概念:
记录作业job的执行进度和状态的工具( 用户的操作日志 )
在程序的某些位置插入计数器,记录数据处理进度的相关信息
作用:
观察MR JOB 运行期的各种执行细节
对MR 进行性能调优
使用:
自定义计数器
1. 创建计数器counter
Counter counter = context.getContext(String groupName,String counter);
2. 设置初始值
counter.setValue(int value)
3. 修改值
counter.increment(long num)
案例:
记录数据清洗的条数
正确的数据量
在map阶段和reduce阶段或者程序的其他阶段使用计数器
三、串行MapReduce Job
场景:
复杂的MR处理场景下,处理过程较复杂,将当前的业务进行切割,切割成多个子job执行
需求:求一组数据的平均值
10873
334
34
1231
9890734
234
将当前的任务进行划分:
1.job1
求所有数据的总和
输入的数据源:统计的数据所在的文件
E:\avg\in
输出结果:指定一个文件
E:\avg\out\sum_count
2.job2
求所有数据的总个数
输入的数据源:统计的数据所在的文件
E:\avg\in
输出结果:指定一个文件
E:\avg\out\sum_count
3.job3
求所有数据的平均值
输入的数据源:job1和job2的处理结果
E:\avg\out\sum_count
输出结果:指定一个文件
E:\avg\out\avgResult
注意:job3需要在job1和job2执行完毕后执行
job3操作的数据时job1和job2处理完毕后的结果数据
案例:
求一组数据的平均值,数量,总和
数据
2342
43534
1
2
34
5
436
45
7
56
3
453
结果
avg 3909
count 12
numSum 46918
思路
编码:
创建三个job
提交job时,需要注意顺序
重写mapper类的方法时: 同一个切片,一个maptask中
setup() / map() / cleanup()
重写reducer类的方法时:针对的数据时同一个分组中的数据
setup() / reduce() / cleanup()
注意:
Text 类,存储字符数据
获取字符数据:toString()方法
Map端
1. 平均数map端
public class AvgMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
/*
* long sum = 0; long count = 0;
*/
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
// 1. 获取行数
String line = value.toString();
String[] datas = line.split("\t");
long valueResult = Long.parseLong(datas[1]);
System.out.println("datas:" + datas[0] + "--->" + datas[1]);
context.write(new Text(datas[0]), new LongWritable(valueResult));
}
/*
* @Override protected void cleanup(Mapper<LongWritable, Text, Text,
* LongWritable>.Context context) throws IOException, InterruptedException {
* System.out.println("clean up................");
* System.out.println("sum:"+sum); System.out.println("count:"+count);
* context.write(new Text("avg"), new LongWritable(sum/count)); }
*/
}
2. 计数端map
public class CountMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
//1. 数据一行
String line = value.toString();
//2. 切割
Long num = Long.parseLong(line);
//3. 输入
context.write(new Text("count"), new LongWritable(num));
}
}
3. sum端map
public class SumMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
long numSum = 0;
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
// 1. 获取列数据中的数值
numSum += Long.parseLong(value.toString());
}
@Override
protected void cleanup(Mapper<LongWritable, Text, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
context.write(new Text("numSum"), new LongWritable(numSum));
}
}
Reduce端
1. 平均数reduce端
public class AvgReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
long sum = 0;
long count = 0;
@Override
protected void reduce(Text key, Iterable<LongWritable> value,
Reducer<Text, LongWritable, Text, LongWritable>.Context arg2) throws IOException, InterruptedException {
//key.equals("count")
System.out.println("key:" + key);
String k = key.toString();
for (LongWritable longWritable : value) {
System.out.println("....................迭代");
if (k.equals("count")) {
count = longWritable.get();
System.out.println("count =inner:" + count);
} else {
sum = longWritable.get();
System.out.println("sum =inner:" + sum);
}
}
System.out.println("sum=" + sum);
System.out.println("count=" + count);
}
@Override
protected void cleanup(Reducer<Text, LongWritable, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
System.out.println("clean up................");
System.out.println("sum:" + sum);
System.out.println("count:" + count);
context.write(new Text("avg"), new LongWritable(sum / count));
}
}
2. 计数端reduce
public class CountReducer
extends Reducer<Text, LongWritable, Text, LongWritable>{
@Override
protected void reduce(Text key, Iterable<LongWritable> values,
Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
//count 21432 count 345 count 456456....
int count = 0;
for (LongWritable longWritable : values) {
count++;
}
context.write(key, new LongWritable(count));
}
}
Client 端
public class AvgClient {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// job所有的路径
Path inSrcPath = new Path("E:/avg/in");
Path sumDesPath = new Path("E:/avg/out/sum_count/sum");
Path countDesPath = new Path("E:/avg/out/sum_count/count");
Path avgDesPath = new Path("E:/avg/out/avgResult");
// 1.获取job1 (求和)
Job job1 = Job.getInstance(conf, "job1");
job1.setJarByClass(AvgClient.class);
// 获取值,设置同样的key
job1.setMapperClass(SumMapper.class);// 切割 k=1,234 k=1,444 k=1,555
// 根据key求和
FileInputFormat.addInputPath(job1, inSrcPath);
FileOutputFormat.setOutputPath(job1, sumDesPath);
job1.setOutputKeyClass(Text.class);// sum
job1.setOutputValueClass(LongWritable.class);// 2434634
// 2.获取job2
Job job2 = Job.getInstance(conf, "job2");
job2.setJarByClass(AvgClient.class);
// 获取值,设置同样的key
job2.setMapperClass(CountMapper.class);// 切割 k=1,234 k=1,444 k=1,555
job2.setReducerClass(CountReducer.class);// 切割 k=1,234 k=1,444 k=1,555
// 根据key求和
FileInputFormat.addInputPath(job2, inSrcPath);
FileOutputFormat.setOutputPath(job2, countDesPath);
job2.setOutputKeyClass(Text.class);// sum
job2.setOutputValueClass(LongWritable.class);// 2434634
// 3.获取job3
Job job3 = Job.getInstance(conf, "job3");
job3.setJarByClass(AvgClient.class);
// 获取值,设置同样的key
job3.setMapperClass(AvgMapper.class);//
job3.setReducerClass(AvgReducer.class);
// 根据key求和
FileInputFormat.addInputPath(job3, sumDesPath);
FileInputFormat.addInputPath(job3, countDesPath);
FileOutputFormat.setOutputPath(job3, avgDesPath);
job3.setOutputKeyClass(Text.class);// sum
job3.setOutputValueClass(LongWritable.class);// 2434634
// 4.提交job1,2,3
// 提交job的顺序
if(job1.waitForCompletion(true)&&job2.waitForCompletion(true)){
if(job3==null)System.out.println("job3 null");
job3.waitForCompletion(true);
}
}
}
四、压缩
好处:
减小传输数据的大小,提高数据传输的效率
使用磁盘IO读取数据,提高读取效率
数据持久化时,减少了磁盘空间
IO读取密集的常见,操作密集的场景不适合使用压缩
在MapReduce中压缩的位置:
1.在map接受数据前
2.在shuffle过程中(map和reduce的数据交互)
3.在reduce处理完毕后
压缩格式:消耗的时间,压缩率,是否允许split切片
在map前的压缩:是否可以切片
常见的压缩格式:
deflate hadoop自带的 .deflate 不允许切片
bzip2 hadoop自带的 .bz2 允许
gzip hadoop自带的 .gz 不允许
Snappy 需要安装 .snappy 不允许
LZO 需要安装 .lzo 允许
编解码器:codec
实现了一种压缩和解压缩的算法的工具类
不同的压缩算法,对应不同的编解码器
编解码器的类型:
org.apache.hadoop.io.compress.BZip2Codec;
org.apache.hadoop.io.compress.DeflateCodec;
org.apache.hadoop.io.compress.GzipCodec;
org.apache.hadoop.io.compress.Lz4Codec;
//获取编解码器对象
CompressionCodec codec = (CompressionCodec) ReflectionUtils
.newInstance(
Class.forName("org.apache.hadoop.io.compress.GzipCodec"),
new Configuration());
//获取流对象
CompressionInputStream cis = codec.createInputStream(fis);
CompressionOutputStream cos = codec.createOutputStream(out);
原理图示:
1. snappy压缩 (在shuffle过程中)
压缩PT级别的数据,速度快
压缩率
不支持split切片
2. LZO 压缩(在map接受数据前)
压缩PT级别的数据,速度快
压缩率比gzip低
支持split切片
压缩的阶段案例:
map接受数据之前:
案例一:编码和解码
用户编码实现处理数据的压缩
shuffle阶段:
案例二:
//开启map端输出数据的压缩
conf.setBoolean("mapreduce.map.output.compress", true);
//设置压缩的算法
conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
reduce输出数据阶段:
案例三:
//通过API的方式开启压缩
FileOutputFormat.setCompressOutput(cwJob, true);
//设置压缩的算法
FileOutputFormat.setOutputCompressorClass(cwJob, BZip2Codec.class);
FileOutputFormat.setOutputCompressorClass(cwJob, GzipCodec.class);
FileOutputFormat.setOutputCompressorClass(cwJob, BZip2Codec.class);
案例:
Unzip
public class UnzipClient {
public static void main(String[] args) throws Exception {
// 1. 创建流对象
FileInputStream fis = new FileInputStream("E:\\zip\\yy.gz");
FileOutputStream fos = new FileOutputStream("E:\\zip\\centos.iso");
CompressionCodec codec = (CompressionCodec) ReflectionUtils
.newInstance(
Class.forName("org.apache.hadoop.io.compress.GzipCodec"),
new Configuration());
CompressionInputStream cis = codec.createInputStream(fis);
// 2. 操作数据
IOUtils.copyBytes(cis, fos, 1024*10);
// 3. 释放资源
cis.close();
fos.close();
System.out.println("unzip over............");
}
}
zip
/**
* 对比多种压缩算法
*
*/
public class ZipClient {
public static void main(String[] args) throws Exception {
Class[] codecClasses = new Class[5];
codecClasses[1] = DeflateCodec.class;
codecClasses[0] = Lz4Codec.class;
codecClasses[2] = GzipCodec.class;
codecClasses[3] = BZip2Codec.class;
//codecClasses[1] = SnappyCodec.class;
Configuration conf = new Configuration();
// 1. 创建输入流/输出流
for (Class class1 : codecClasses) {
FileInputStream in = new FileInputStream("F:/other_soft/os/CentOS-7-x86_64-Minimal-1611.iso");
long startTime = System.currentTimeMillis();
// 构建具有压缩能力的输出流
@SuppressWarnings("unchecked")
CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(class1, conf);
FileOutputStream out = new FileOutputStream("e://zip/yy" + codec.getDefaultExtension());
CompressionOutputStream cos = codec.createOutputStream(out);
// 2. 移动数据
IOUtils.copyBytes(in, cos, 1024 * 5);
// 3.释放资源
cos.close();
System.out.println("压缩方式:" + class1.getSimpleName() + ",消耗的时间:" + (System.currentTimeMillis() - startTime));
System.out.println("----------------------------------------");
}
}
}