序列化分析:
序列化和反序列化就是结构化对象和字节流之间的转换,主要用在内部进程的通讯和持久化存储方面。
hadoop在节点间的内部通讯使用的是RPC,RPC协议把消息翻译成二进制字节流发送到远程节点,远程节点再通过反序列化把二进制流转成原始的信息。RPC的序列化需要实现以下几点:
1.压缩,可以起到压缩的效果,占用的宽带资源要小
2.快速,内部进程为分布式系统构建了高速链路,因此在序列化和反序列化间必须是快速的
3.可扩展的,新的服务端为新的客户端增加了一个参数,老客户端照样可以使用
4.兼容性好,可以支持多个语言的客户端
hadoop自身的序列化存储格式就是实现了Writabl·e接口的类,
其只实现了前面两点,压缩和快速。但是不容易扩展,也不跨语言
Writable类层次结构
但是这些有时并不能满足我们的需求,这时我们就需要自定义Writable来实现某个javabean的可序列化了。
接下来我们举一个扑克牌的例子
问题描述:
假如3张人头牌(扑克中J,Q,K)从扑克中拿走. 怎么通过MapReduce找到哪些花色缺牌?
输入文件:(red红桃rect方块black黑桃flower梅花)
red-2
red-3
red-4
red-5
red-6
red-7
red-8
red-9
red-10
red-11
red-13
rect-1
rect-2
rect-3
rect-4
rect-5
rect-6
rect-7
rect-8
rect-9
rect-10
rect-11
rect-12
black-1
black-2
black-3
black-4
black-5
black-6
black-7
black-8
black-9
black-10
black-11
black-12
black-13
flower-1
flower-2
flower-3
flower-4
flower-5
flower-6
flower-7
flower-8
flower-9
flower-10
flower-12
flower-13
JavaBean(CardBean)为:
package SerializableTest;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class CardBean implements Writable{
private String kind;
private int number;
public String getKind() {
return kind;
}
public void setKind(String kind) {
this.kind = kind;
}
public int getNumber() {
return number;
}
public void setNumber(int number) {
this.number = number;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(kind);
out.writeInt(number);
}
@Override
public void readFields(DataInput in) throws IOException {
//读和写的顺序要保持一致
kind = in.readUTF();
number = in.readInt();
}
}
因为是CardBean是放在value中的,所以满足Writable就可以了
Mapper:
package SerializableTest;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class PokerMapper extends Mapper<LongWritable, Text, Text, CardBean>{
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, CardBean>.Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] strs = line.split("-");
if(strs.length==2){
CardBean cardBean = new CardBean();
cardBean.setKind(strs[0]);
cardBean.setNumber( Integer.valueOf(strs[1]));
if(cardBean.getNumber()>10){
//大于10的花牌,需要转到reduce里
context.write(new Text(strs[0]), cardBean);
}
}
}
}
Reducer:
package SerializableTest;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class PokerReduce extends Reducer<Text, CardBean, Text, LongWritable>{
@Override
protected void reduce(Text key, Iterable<CardBean> iter,
Reducer<Text, CardBean, Text, LongWritable>.Context context) throws IOException, InterruptedException {
int count = 0;
while(iter.iterator().hasNext()){
iter.iterator().next();
count++;
}
if(count < 3){
context.write(key, new LongWritable(count));
}
}
}
Runner:
package SerializableTest;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
public class PokerRunner {
public static void main(String[] args) {
// TODO Auto-generated method stub
if (args.length!=2){
System.out.println("Usage: Serializable Writable");
System.exit(-1);
}
Configuration conf=new Configuration();
try {
Job job=Job.getInstance(conf);
//getTnstance()保证一个类仅有一个实例,并提供一个访问它的全局访问点
//指定runner mapper reducer 的执行类
job.setJarByClass(PokerRunner.class);
job.setMapperClass(PokerMapper.class);
job.setReducerClass(PokerReduce.class);
//设置map中key value 的输出类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(CardBean.class);
//设置reduce中key value 的输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.waitForCompletion(true);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
输出文件:
red 2
rect 2
flower 2