@
需求
有三个文件,里面记录着一些单词,请统计每个单词分别在每个文件出现的次数。
数据输入
期待输出
比如:atguigu c.txt-->2 b.txt-->2 a.txt-->3
分析
如果一个需求,一个MRjob无法完成,可以将需求拆分为若干Job,多个Job按照依赖关系依次执行!
Job1
:Mapper
: 默认一个MapTask只处理一个切片的数据,默认的切片策略,一个切片只属于一个文件。
- keyin-valuein:atguigu pingping
- keyout-valueout:atguigu-a.txt,1
Reducer
:
- keyin-valuein: atguigu-a.txt,1(mapper的输出,作为reducer的输入)
- keyout-valueout: atguigu-a.txt,3
pingping-a.txt,2
atguigu-b.txt,3
pingping-b.txt,2
Job2
:Mapper
: 默认一个MapTask只处理一个切片的数据,默认的切片策略,一个切片只属于一个文件。
- keyin-valuein: pingping,a.txt-2(上一个Job的reducer的输出,作为本次job的mapper的输入)
- keyout-valueout: pingping,a.txt-2(原封不动的输出)
Reducer
:
- keyin-valuein:
pingping,a.txt-2
pingping,b.txt-2 - keyout-valueout:pingping,a.txt-2 b.txt-2(最后将相同key下的value拼接即可)
代码实现
Mapper1.java
/*
* 1.输入
* atguigu pingping
* 2.输出
* atguigu-a.txt,1
*/
public class Example1Mapper1 extends Mapper<LongWritable, Text, Text, IntWritable>{
private String filename;
private Text out_key=new Text();
private IntWritable out_value=new IntWritable(1);
@Override
protected void setup(Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
InputSplit inputSplit = context.getInputSplit();
FileSplit split=(FileSplit)inputSplit;
filename=split.getPath().getName();
}
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
for (String word : words) {
out_key.set(word+"-"+filename);
context.write(out_key, out_value);
}
}
}
Reducer1.java
/*
* 1.输入
* atguigu-a.txt,1
* atguigu-a.txt,1
* atguigu-a.txt,1
* 2.输出
* atguigu-a.txt,3
*/
public class Example1Reducer1 extends Reducer<Text, IntWritable, Text, IntWritable>{
private IntWritable out_value=new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int sum=0;
for (IntWritable value : values) {
sum+=value.get();
}
out_value.set(sum);
context.write(key, out_value);
}
}
Mapper2.java
/*
* 1.输入
* atguigu-a.txt\t3
* atguigu-b.txt\t3
* 使用KeyValueTextInputFormat,可以使用一个分隔符,分隔符之前的作为key,之后的作为value
* 2.输出
* atguigu,a.txt\t3
* atguigu,b.txt\t3
*/
public class Example1Mapper2 extends Mapper<Text, Text, Text, Text>{
//不用重写map方法,父方法会自动将输入的key-value强转成输出的key-value
}
Reducer2.java
/*
* 1.输入
* atguigu,a.txt\t3
* atguigu,b.txt\t3
*
* 2.输出
* atguigu,a.txt\t3 b.txt\t3
*
*/
public class Example1Reducer2 extends Reducer<Text, Text, Text, Text>{
private Text out_value=new Text();
@Override
protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
StringBuffer sb = new StringBuffer();
//拼接value
for (Text value : values) {
sb.append(value.toString()+" ");
}
out_value.set(sb.toString());
context.write(key, out_value);
}
}
Driver.java
/*
* 1. Example1Driver 提交两个Job
* Job2 必须 依赖于 Job1,必须在Job1已经运行完成之后,生成结果后,才能运行!
*
* 2. JobControl: 定义一组MR jobs,还可以指定其依赖关系
* 可以通过addJob(ControlledJob aJob)向一个JobControl中添加Job对象!
*
* 3. ControlledJob: 可以指定依赖关系的Job对象
* addDependingJob(ControlledJob dependingJob): 为当前Job添加依赖的Job
* public ControlledJob(Configuration conf) : 基于配置构建一个ControlledJob
*
*/
public class Example1Driver {
public static void main(String[] args) throws Exception {
//定义路径
Path inputPath=new Path("e:/mrinput/index");
Path outputPath=new Path("e:/mroutput/index");
Path finalOutputPath=new Path("e:/mroutput/finalindex");
//作为整个Job的配置
Configuration conf1 = new Configuration();
Configuration conf2 = new Configuration();
conf2.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", "-");
//保证输出目录不存在
FileSystem fs=FileSystem.get(conf1);
if (fs.exists(outputPath)) {
fs.delete(outputPath, true);
}
if (fs.exists(finalOutputPath)) {
fs.delete(finalOutputPath, true);
}
// ①创建Job
Job job1 = Job.getInstance(conf1);
Job job2 = Job.getInstance(conf2);
// 设置Job名称
job1.setJobName("index1");
job2.setJobName("index2");
// ②设置Job1
job1.setMapperClass(Example1Mapper1.class);
job1.setReducerClass(Example1Reducer1.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
// 设置输入目录和输出目录
FileInputFormat.setInputPaths(job1, inputPath);
FileOutputFormat.setOutputPath(job1, outputPath);
// ②设置Job2
job2.setMapperClass(Example1Mapper2.class);
job2.setReducerClass(Example1Reducer2.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
// 设置输入目录和输出目录
FileInputFormat.setInputPaths(job2, outputPath);
FileOutputFormat.setOutputPath(job2, finalOutputPath);
// 设置job2的输入格式
job2.setInputFormatClass(KeyValueTextInputFormat.class);
//--------------------------------------------------------
//构建JobControl
JobControl jobControl = new JobControl("index");
//创建运行的Job
ControlledJob controlledJob1 = new ControlledJob(job1.getConfiguration());
ControlledJob controlledJob2 = new ControlledJob(job2.getConfiguration());
//指定依赖关系
controlledJob2.addDependingJob(controlledJob1);
// 向jobControl设置要运行哪些job
jobControl.addJob(controlledJob1);
jobControl.addJob(controlledJob2);
//运行JobControl
Thread jobControlThread = new Thread(jobControl);
//设置此线程为守护线程
jobControlThread.setDaemon(true);
jobControlThread.start();
//获取JobControl线程的运行状态
while(true) {
//判断整个jobControl是否全部运行结束
if (jobControl.allFinished()) {
System.out.println(jobControl.getSuccessfulJobList());
return;
}
}
}
}
输出结果
job1的输出:
最终结果的输出: