一、问题描述
下面给出一个child-parent的表格,要求挖掘其中的父子辈关系,给出祖孙辈关系的表格。
输入文件内容如下:
child parent
Steven Lucy
Steven Jack
Jone Lucy
Jone Jack
Lucy Mary
Lucy Frank
Jack Alice
Jack Jesse
David Alice
David Jesse
Philip David
Philip Alma
Mark David
Mark Alma
根据父辈和子辈挖掘爷孙关系。比如:
Steven Jack
Jack Alice
Jack Jesse
根据这三条记录,可以得出Jack是Steven的长辈,而Alice和Jesse是Jack的长辈,很显然Steven是Alice和Jesse的孙子。挖掘出的结果如下:
grandson grandparent
Steven Jesse
Steven Alice
要求通过MapReduce挖掘出所有的爷孙关系。
二、分析
解决这个问题要用到一个小技巧,就是单表关联。具体实现步骤如下,Map阶段每一行的key-value输入,同时也把value-key输入。以其中的两行为例:
Steven Jack
Jack Alice
key-value和value-key都输入,变成4行:
Steven Jack
Jack Alice
Jack Steven
Alice Jack
shuffle以后,Jack作为key值,起到承上启下的桥梁作用,Jack对应的values包含Alice、Steven,这时候Alice和Steven肯定是爷孙关系。为了标记哪些是孙子辈,哪些是爷爷辈,可以在Map阶段加上前缀,比如小辈加上前缀”-“,长辈加上前缀”+”。加上前缀以后,在Reduce阶段就可以根据前缀进行分类。
三、MapReduce程序
package com.cl.hadoop.relations;
import com.cl.hadoop.FileUtil;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.ArrayList;
import java.util.StringTokenizer;
public class RelationShip {
public static class RsMapper extends Mapper<Object, Text, Text, Text> {
private static int linenum = 0;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
if (linenum == 0) {
++linenum;
} else {
StringTokenizer tokenizer = new StringTokenizer(line, "\n");
while (tokenizer.hasMoreElements()) {
StringTokenizer lineTokenizer = new StringTokenizer(tokenizer.nextToken());
String son = lineTokenizer.nextToken();
String parent = lineTokenizer.nextToken();
context.write(new Text(parent), new Text(
"-" + son));
context.write(new Text(son), new Text
("+" + parent));
}
}
}
}
public static class RsReducer extends Reducer<Text, Text, Text, Text> {
private static int linenum = 0;
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
if (linenum == 0) {
context.write(new Text("grandson"), new Text("grandparent"));
++linenum;
}
ArrayList<Text> grandChild = new ArrayList<Text>();
ArrayList<Text> grandParent = new ArrayList<Text>();
for (Text val : values) {
String s = val.toString();
if (s.startsWith("-")) {
grandChild.add(new Text(s.substring(1)));
} else {
grandParent.add(new Text(s.substring(1)));
}
}
for (Text text1 : grandChild) {
for (Text text2 : grandParent) {
context.write(text1, text2);
}
}
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
FileUtil.deleteDir("output");
Configuration cong = new Configuration();
String[] otherArgs = new String[]{"input/relations/table.txt",
"output"};
if (otherArgs.length != 2) {
System.out.println("参数错误");
System.exit(2);
}
Job job = Job.getInstance();
job.setJarByClass(RelationShip.class);
job.setMapperClass(RsMapper.class);
job.setReducerClass(RsReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
四、输出结果
grandson grandparent
Mark Jesse
Mark Alice
Philip Jesse
Philip Alice
Jone Jesse
Jone Alice
Steven Jesse
Steven Alice
Steven Frank
Steven Mary
Jone Frank
Jone Mary