数据去重"主要是为了掌握和利用并行化思想来对数据进行有意义的筛选。统计大数据集上的数据种类个数、从网站日志中计算访问地等这些看似庞杂的任务都会涉及数据去重。下面就进入这个实例的MapReduce程序设计。
1.1 实例描述
对数据文件中的数据进行去重。数据文件中的每行都是一个数据。
输入如下所示:
1.txt 内容
1
1
2
2
2.txt内容
4
4
3
3
样例输出如下:
1
2
3
4
将测试数据上传到hdfs上/input目录下,同时确保输出目录不存在。
1.2 设计思路
数据去重的最终目标是让原始数据中出现次数超过一次的数据在输出文件中只出现一次。
Map输入key 为行号,value为行的内容(假设这是我们所需要取得值,当然如果不是,可是使用字符分割得到)
Map输出key为行的内容,value为NullWritable(无值)
Reduce输入Key为值,value为NullWritable
Reduce原样输出
中间可在map端使用combine
1.3 程序代码
package test;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class Dedub {
public static class Map extends Mapper<Object, Text, Text, NullWritable>{
protected void map(Object key, Text value, Context context)
throws java.io.IOException ,InterruptedException {
context.write(value, NullWritable.get());
};
}
public static class Reduce extends Reducer<Text, NullWritable, Text, NullWritable>{
protected void reduce(Text key, Iterable<NullWritable> values, Context context)
throws java.io.IOException ,InterruptedException {
context.write(key, NullWritable.get());
};
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: Data Deduplication <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "Data Duplication");
job.setJarByClass(Dedub.class);
//设置Map、Combine和Reduce处理类
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
//设置输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//设置输入和输出目录
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
4、设置程序输入参数,myeclipse设置。运行,得到结果:
14/06/15 22:01:32 INFO mapred.JobClient: map 100% reduce 100%
14/06/15 22:01:32 INFO mapred.JobClient: Job complete: job_local_0001
14/06/15 22:01:32 INFO mapred.JobClient: Counters: 19
14/06/15 22:01:32 INFO mapred.JobClient: File Output Format Counters
14/06/15 22:01:32 INFO mapred.JobClient: Bytes Written=9
14/06/15 22:01:32 INFO mapred.JobClient: FileSystemCounters
14/06/15 22:01:32 INFO mapred.JobClient: FILE_BYTES_READ=81479
14/06/15 22:01:32 INFO mapred.JobClient: HDFS_BYTES_READ=43
14/06/15 22:01:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=279482
14/06/15 22:01:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=9
14/06/15 22:01:32 INFO mapred.JobClient: File Input Format Counters
14/06/15 22:01:32 INFO mapred.JobClient: Bytes Read=17
14/06/15 22:01:32 INFO mapred.JobClient: Map-Reduce Framework
14/06/15 22:01:32 INFO mapred.JobClient: Map output materialized bytes=31
14/06/15 22:01:32 INFO mapred.JobClient: Map input records=9
14/06/15 22:01:32 INFO mapred.JobClient: Reduce shuffle bytes=0
14/06/15 22:01:32 INFO mapred.JobClient: Spilled Records=10
14/06/15 22:01:32 INFO mapred.JobClient: Map output bytes=17
14/06/15 22:01:32 INFO mapred.JobClient: Total committed heap usage (bytes)=492109824
14/06/15 22:01:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=190
14/06/15 22:01:32 INFO mapred.JobClient: Combine input records=9
14/06/15 22:01:32 INFO mapred.JobClient: Reduce input records=5
14/06/15 22:01:32 INFO mapred.JobClient: Reduce input groups=5
14/06/15 22:01:32 INFO mapred.JobClient: Combine output records=5
14/06/15 22:01:32 INFO mapred.JobClient: Reduce output records=5
14/06/15 22:01:32 INFO mapred.JobClient: Map output records=9