统计手机号耗费的总上行流量、下行流量、总流量(序列化)
统计总上行流量、总下行流量。
数据准备
输入数据格式:
数据格式:时间戳、电话号码、基站的物理地址、访问网址的ip、网站域名、数据包、接包数、上行/传流量、下行/载流量、响应码 |
输出数据格式:
1356·0436666 1116 954 2070 手机号码 上行流量 下行流量 总流量 |
分析-基本思路:
Map阶段:
(1)读取一行数据,切分字段
(2)抽取手机号、上行流量、下行流量
(3)以手机号为key,bean对象为value输出,即context.write(手机号,bean);
Reduce阶段:
(1)累加上行流量和下行流量得到总流量。
(2)实现自定义的bean来封装流量信息,并将bean作为map输出的key来传输
(3)MR程序在处理数据的过程中会对数据排序(map输出的kv对传输到reduce之前,会排序),排序的依据是map输出的key
所以,我们如果要实现自己需要的排序规则,则可以考虑将排序因素放到key中,让key实现接口:Writable。
一、封装类LiuLiangBean-实现Writable接口-序列化
package liu.liang;
import lombok.Getter;
import lombok.NoArgsConstructor;
import lombok.Setter;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
@Getter
@Setter
@NoArgsConstructor
/**
* 封装类LiuLiangBean-实现Writable接口-序列化,自定义的数据类型想要在Hadoop集群中传递,需要实现Hadoop的序列化框架
*/
public class LiuLiangBean implements Writable {
//上行流量
private long upflow;
//下行流量
private long downflow;
//总流量
private long sumflow;
public LiuLiangBean(long upflow, long downflow) {
this.upflow = upflow;
this.downflow = downflow;
this.sumflow = upflow+downflow;
}
/** 序列化-将我们要传输的数据序列化成字节流
* @param dataOutput
* @throws IOException
*/
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(upflow);
dataOutput.writeLong(downflow);
dataOutput.writeLong(sumflow);
}
/**反序列化-从数据字节流中逐个恢复出各个字段 ,因为反射机制的需要,需要定义一个无参构造函数
* @param dataInput
* @throws IOException
*/
@Override
public void readFields(DataInput dataInput) throws IOException {
this.upflow = dataInput.readLong();
this.downflow = dataInput.readLong();
this.sumflow = dataInput.readLong();
}
@Override
public String toString() {
return this.upflow + "\t" + this.downflow + "\t" + sumflow;
}
}
二、分隔类-继承Mapper类
package liu.liang;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* 分隔类-继承Mapper类
*/
public class LiuLiangMapper extends Mapper<LongWritable, Text,Text,LiuLiangBean> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] fields = line.split("\t");
//转换类型:String--》Long
long upflow = Long.parseLong(fields[fields.length - 3]);
long downflow = Long.parseLong(fields[fields.length - 2]);
//电话号码作Key,上行流量和下行流量作Value
context.write(new Text(fields[1]),new LiuLiangBean(upflow,downflow));
}
}
三、统计总上行流量、总下行流量类
package liu.liang;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* 统计总上行流量、总下行流量类
*/
public class LiuLiangReducer extends Reducer<Text,LiuLiangBean,Text,LiuLiangBean> {
@Override
protected void reduce(Text key, Iterable<LiuLiangBean> values, Context context) throws IOException, InterruptedException {
long sumUpFlow = 0;
long sumDownFlow = 0;
for(LiuLiangBean value:values){
sumUpFlow += value.getUpflow();
sumDownFlow += value.getUpflow();
}
context.write(key,new LiuLiangBean(sumUpFlow,sumDownFlow));
}
}
四、执行类
package liu.liang;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* 执行类
*/
public class LiuLiangDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
long startTime = System.currentTimeMillis();
args = new String[]{"D:/phone_data.txt", "D:/HDFS/p_d"};
//1.获取配置信息
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//2.反射类
job.setJarByClass(LiuLiangDriver.class);
job.setMapperClass(LiuLiangMapper.class);
job.setReducerClass(LiuLiangReducer.class);
//3.Reduce输入、输出的K、V类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LiuLiangBean.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LiuLiangBean.class);
//4.数据的输入和输出的指定目录
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
//5.提交job
job.waitForCompletion(true);
long endTime = System.currentTimeMillis();
System.out.println("程序运行的时间为:"+(endTime-startTime));
}
}
五、总结
注意:
这里的map(LongWritable key, Text value, Context context)方法中的值:Text value一行数据进行的。
而reduce(Text key, Iterable<LiuLiangBean>values, Context context)方法中的值:Iterable<LiuLiangBean>valueskey的值的那部分数据进行操作的。
也就是说,这里的key唯一,因此总要用到for循环。这个区别可以根据value是否为复数(即:value/values)进行区分。