MapReduce实战之流量汇总案例

原创

年轻即出发 2022-11-11 10:51:12 博主文章分类：Hadoop ©著作权

文章标签 apache hadoop mapreduce 文章分类 运维

©著作权归作者所有：来自51CTO博客作者年轻即出发的原创作品，请联系作者获取转载授权，否则将追究法律责任

2.1 需求1：统计手机号耗费的总上行流量、下行流量、总流量（序列化）

1）需求：

统计每一个手机号耗费的总上行流量、下行流量、总流量

2）数据准备

1363157985066    13726230503   00-FD-07-A4-72-B8:CMCC   120.196.100.82   i02.c.aliimg.com       24   27   2481   24681   200
1363157995052    13826544101   5C-0E-8B-C7-F1-E0:CMCC   120.197.40.4           4   0   264   0   200
1363157991076    13926435656   20-10-7A-28-CC-0A:CMCC   120.196.100.99           2   4   132   1512   200
1363154400022    13926251106   5C-0E-8B-8B-B1-50:CMCC   120.197.40.4           4   0   240   0   200
1363157993044    18211575961   94-71-AC-CD-E6-18:CMCC-EASY   120.196.100.99   iface.qiyi.com   视频网站   15   12   1527   2106   200
1363157995074    84138413   5C-0E-8B-8C-E8-20:7DaysInn   120.197.40.4   122.72.52.12       20   16   4116   1432   200
1363157993055    13560439658   C4-17-FE-BA-DE-D9:CMCC   120.196.100.99           18   15   1116   954   200
1363157995033    15920133257   5C-0E-8B-C7-BA-20:CMCC   120.197.40.4   sug.so.360.cn   信息安全   20   20   3156   2936   200
1363157983019    13719199419   68-A1-B7-03-07-B1:CMCC-EASY   120.196.100.82           4   0   240   0   200
1363157984041    13660577991   5C-0E-8B-92-5C-20:CMCC-EASY   120.197.40.4   s19.cnzz.com   站点统计   24   9   6960   690   200
1363157973098    15013685858   5C-0E-8B-C7-F7-90:CMCC   120.197.40.4   rank.ie.sogou.com   搜索引擎   28   27   3659   3538   200
1363157986029    15989002119   E8-99-C4-4E-93-E0:CMCC-EASY   120.196.100.99   www.umeng.com   站点统计   3   3   1938   180   200
1363157992093    13560439658   C4-17-FE-BA-DE-D9:CMCC   120.196.100.99           15   9   918   4938   200
1363157986041    13480253104   5C-0E-8B-C7-FC-80:CMCC-EASY   120.197.40.4           3   3   180   180   200
1363157984040    13602846565   5C-0E-8B-8B-B6-00:CMCC   120.197.40.4   2052.flash2-http.qq.com   综合门户   15   12   1938   2910   200
1363157995093    13922314466   00-FD-07-A2-EC-BA:CMCC   120.196.100.82   img.qfc.cn       12   12   3008   3720   200
1363157982040    13502468823   5C-0A-5B-6A-0B-D4:CMCC-EASY   120.196.100.99   y0.ifengimg.com   综合门户   57   102   7335   110349   200
1363157986072    18320173382   84-25-DB-4F-10-1A:CMCC-EASY   120.196.100.99   input.shouji.sogou.com   搜索引擎   21   18   9531   2412   200
1363157990043    13925057413   00-1F-64-E1-E6-9A:CMCC   120.196.100.55   t3.baidu.com   搜索引擎   69   63   11058   48243   200
1363157988072    13760778710   00-FD-07-A4-7B-08:CMCC   120.196.100.82           2   2   120   120   200
1363157985066    13560436666   00-FD-07-A4-72-B8:CMCC   120.196.100.82   i02.c.aliimg.com       24   27   2481   24681   200
1363157993055    13560436666   C4-17-FE-BA-DE-D9:CMCC   120.196.100.99           18   15   1116   954   200

MapReduce实战之流量汇总案例_mapreduce

3）分析

基本思路：

Map阶段：

（1）读取一行数据，切分字段

（2）抽取手机号、上行流量、下行流量

（3）以手机号为key，bean对象为value输出，即context.write(手机号,bean);

Reduce阶段：

（1）累加上行流量和下行流量得到总流量。

（2）实现自定义的bean来封装流量信息，并将bean作为map输出的key来传输

（3）MR程序在处理数据的过程中会对数据排序(map输出的kv对传输到reduce之前，会排序)，排序的依据是map输出的key

所以，我们如果要实现自己需要的排序规则，则可以考虑将排序因素放到key中，让key实现接口：WritableComparable。

然后重写key的compareTo方法。

4）编写mapreduce程序

（1）编写流量统计的bean对象

package com.atguigu.mapreduce.flowsum; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.Writable; // 1 实现writable接口 public class FlowBean implements Writable{ private long upFlow ; private long downFlow; private long sumFlow; //2 反序列化时，需要反射调用空参构造函数，所以必须有 public FlowBean() { super(); } public FlowBean(long upFlow, long downFlow) { super(); this.upFlow = upFlow; this.downFlow = downFlow; this.sumFlow = upFlow + downFlow; } //3 写序列化方法 @Override public void write(DataOutput out) throws IOException { out.writeLong(upFlow); out.writeLong(downFlow); out.writeLong(sumFlow); } //4 反序列化方法 //5 反序列化方法读顺序必须和写序列化方法的写顺序必须一致 @Override public void readFields(DataInput in) throws IOException { this.upFlow = in.readLong(); this.downFlow = in.readLong(); this.sumFlow = in.readLong(); } // 6 编写toString方法，方便后续打印到文本 @Override public String toString() { return upFlow + "\t" + downFlow + "\t" + sumFlow; } public long getUpFlow() { return upFlow; } public void setUpFlow(long upFlow) { this.upFlow = upFlow; } public long getDownFlow() { return downFlow; } public void setDownFlow(long downFlow) { this.downFlow = downFlow; } public long getSumFlow() { return sumFlow; } public void setSumFlow(long sumFlow) { this.sumFlow = sumFlow; } }

package com.atguigu.mapreduce.flowsum;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
 
// 1 实现writable接口
public class FlowBean implements Writable{
 
       private long upFlow ;
       private long downFlow;
       private long sumFlow;
      
       //2  反序列化时，需要反射调用空参构造函数，所以必须有
       public FlowBean() {
              super();
       }
 
       public FlowBean(long upFlow, long downFlow) {
              super();
              this.upFlow = upFlow;
              this.downFlow = downFlow;
              this.sumFlow = upFlow + downFlow;
       }
      
       //3  写序列化方法
       @Override
       public void write(DataOutput out) throws IOException {
              out.writeLong(upFlow);
              out.writeLong(downFlow);
              out.writeLong(sumFlow);
       }
      
       //4 反序列化方法
       //5 反序列化方法读顺序必须和写序列化方法的写顺序必须一致
       @Override
       public void readFields(DataInput in) throws IOException {
              this.upFlow  = in.readLong();
              this.downFlow = in.readLong();
              this.sumFlow = in.readLong();
       }
 
       // 6 编写toString方法，方便后续打印到文本
       @Override
       public String toString() {
              return upFlow + "\t" + downFlow + "\t" + sumFlow;
       }
 
       public long getUpFlow() {
              return upFlow;
       }
 
       public void setUpFlow(long upFlow) {
              this.upFlow = upFlow;
       }
 
       public long getDownFlow() {
              return downFlow;
       }
 
       public void setDownFlow(long downFlow) {
              this.downFlow = downFlow;
       }
 
       public long getSumFlow() {
              return sumFlow;
       }
 
       public void setSumFlow(long sumFlow) {
              this.sumFlow = sumFlow;
       }
 
}

（2）编写mapper

package com.atguigu.mapreduce.flowsum; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class FlowCountMapper extends Mapper<LongWritable, Text, Text, FlowBean>{ FlowBean v = new FlowBean(); Text k = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 1 获取一行 String line = value.toString(); // 2 切割字段 String[] fields = line.split("\t"); // 3 封装对象 // 取出手机号码 String phoneNum = fields[1]; // 取出上行流量和下行流量 long upFlow = Long.parseLong(fields[fields.length - 3]); long downFlow = Long.parseLong(fields[fields.length - 2]); v.set(downFlow, upFlow); // 4 写出 context.write(new Text(phoneNum), new FlowBean(upFlow, downFlow)); } }

package com.atguigu.mapreduce.flowsum;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
 
public class FlowCountMapper extends Mapper<LongWritable, Text, Text, FlowBean>{
      
       FlowBean v = new FlowBean();
       Text k = new Text();
      
       @Override
       protected void map(LongWritable key, Text value, Context context)
                     throws IOException, InterruptedException {
             
              // 1 获取一行
              String line = value.toString();
             
              // 2 切割字段
              String[] fields = line.split("\t");
             
              // 3 封装对象
              // 取出手机号码
              String phoneNum = fields[1];
              // 取出上行流量和下行流量
              long upFlow = Long.parseLong(fields[fields.length - 3]);
              long downFlow = Long.parseLong(fields[fields.length - 2]);
             
              v.set(downFlow, upFlow);
             
              // 4 写出
              context.write(new Text(phoneNum), new FlowBean(upFlow, downFlow));
       }
}

（3）编写reducer

package com.atguigu.mapreduce.flowsum; import java.io.IOException; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class FlowCountReducer extends Reducer<Text, FlowBean, Text, FlowBean> { @Override protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException { long sum_upFlow = 0; long sum_downFlow = 0; // 1 遍历所用bean，将其中的上行流量，下行流量分别累加 for (FlowBean flowBean : values) { sum_upFlow += flowBean.getSumFlow(); sum_downFlow += flowBean.getDownFlow(); } // 2 封装对象 FlowBean resultBean = new FlowBean(sum_upFlow, sum_downFlow); // 3 写出 context.write(key, resultBean); } }

package com.atguigu.mapreduce.flowsum;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
 
public class FlowCountReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
 
       @Override
       protected void reduce(Text key, Iterable<FlowBean> values, Context context)
                     throws IOException, InterruptedException {
 
              long sum_upFlow = 0;
              long sum_downFlow = 0;
 
              // 1 遍历所用bean，将其中的上行流量，下行流量分别累加
              for (FlowBean flowBean : values) {
                     sum_upFlow += flowBean.getSumFlow();
                     sum_downFlow += flowBean.getDownFlow();
              }
 
              // 2 封装对象
              FlowBean resultBean = new FlowBean(sum_upFlow, sum_downFlow);
             
              // 3 写出
              context.write(key, resultBean);
       }
}

（4）编写驱动

package com.atguigu.mapreduce.flowsum; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class FlowsumDriver { public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException { // 1 获取配置信息，或者job对象实例 Configuration configuration = new Configuration(); Job job = Job.getInstance(configuration); // 6 指定本程序的jar包所在的本地路径 job.setJarByClass(FlowsumDriver.class); // 2 指定本业务job要使用的mapper/Reducer业务类 job.setMapperClass(FlowCountMapper.class); job.setReducerClass(FlowCountReducer.class); // 3 指定mapper输出数据的kv类型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(FlowBean.class); // 4 指定最终输出的数据的kv类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(FlowBean.class); // 5 指定job的输入原始文件所在目录 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 7 将job中配置的相关参数，以及job所用的java类所在的jar包，提交给yarn去运行 boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } }

package com.atguigu.mapreduce.flowsum;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
public class FlowsumDriver {
 
       public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException {
             
              // 1 获取配置信息，或者job对象实例
              Configuration configuration = new Configuration();
              Job job = Job.getInstance(configuration);
 
              // 6 指定本程序的jar包所在的本地路径
              job.setJarByClass(FlowsumDriver.class);
 
              // 2 指定本业务job要使用的mapper/Reducer业务类
              job.setMapperClass(FlowCountMapper.class);
              job.setReducerClass(FlowCountReducer.class);
 
              // 3 指定mapper输出数据的kv类型
              job.setMapOutputKeyClass(Text.class);
              job.setMapOutputValueClass(FlowBean.class);
 
              // 4 指定最终输出的数据的kv类型
              job.setOutputKeyClass(Text.class);
              job.setOutputValueClass(FlowBean.class);
             
              // 5 指定job的输入原始文件所在目录
              FileInputFormat.setInputPaths(job, new Path(args[0]));
              FileOutputFormat.setOutputPath(job, new Path(args[1]));
 
              // 7 将job中配置的相关参数，以及job所用的java类所在的jar包， 提交给yarn去运行
              boolean result = job.waitForCompletion(true);
              System.exit(result ? 0 : 1);
       }
}

2.2 需求2：将统计结果按照手机归属地不同省份输出到不同文件中（Partitioner）

0）需求：将统计结果按照手机归属地不同省份输出到不同文件中（分区）

1）数据准备

数据同上

2）分析

（1）Mapreduce中会将map输出的kv对，按照相同key分组，然后分发给不同的reducetask。默认的分发规则为：根据key的hashcode%reducetask数来分发

（2）如果要按照我们自己的需求进行分组，则需要改写数据分发（分组）组件Partitioner

自定义一个CustomPartitioner继承抽象类：Partitioner

（3）在job驱动中，设置自定义partitioner： job.setPartitionerClass(CustomPartitioner.class)

3）在需求1的基础上，增加一个分区类

package com.atguigu.mapreduce.flowsum; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Partitioner; public class ProvincePartitioner extends Partitioner<Text, FlowBean> { @Override public int getPartition(Text key, FlowBean value, int numPartitions) { // 1 获取电话号码的前三位 String preNum = key.toString().substring(0, 3); int partition = 4; // 2 判断是哪个省 if ("136".equals(preNum)) { partition = 0; }else if ("137".equals(preNum)) { partition = 1; }else if ("138".equals(preNum)) { partition = 2; }else if ("139".equals(preNum)) { partition = 3; } return partition; } }

package com.atguigu.mapreduce.flowsum;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
 
public class ProvincePartitioner extends Partitioner<Text, FlowBean> {
 
       @Override
       public int getPartition(Text key, FlowBean value, int numPartitions) {
              // 1 获取电话号码的前三位
              String preNum = key.toString().substring(0, 3);
             
              int partition = 4;
             
              // 2 判断是哪个省
              if ("136".equals(preNum)) {
                     partition = 0;
              }else if ("137".equals(preNum)) {
                     partition = 1;
              }else if ("138".equals(preNum)) {
                     partition = 2;
              }else if ("139".equals(preNum)) {
                     partition = 3;
              }
 
              return partition;
       }
}

2）在驱动函数中增加自定义数据分区设置和reduce task设置

package com.atguigu.mapreduce.flowsum; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class FlowsumDriver { public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException { // 1 获取配置信息，或者job对象实例 Configuration configuration = new Configuration(); Job job = Job.getInstance(configuration); // 6 指定本程序的jar包所在的本地路径 job.setJarByClass(FlowsumDriver.class); // 2 指定本业务job要使用的mapper/Reducer业务类 job.setMapperClass(FlowCountMapper.class); job.setReducerClass(FlowCountReducer.class); // 3 指定mapper输出数据的kv类型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(FlowBean.class); // 4 指定最终输出的数据的kv类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(FlowBean.class); // 8 指定自定义数据分区 job.setPartitionerClass(ProvincePartitioner.class); // 9 同时指定相应数量的reduce task job.setNumReduceTasks(5); // 5 指定job的输入原始文件所在目录 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 7 将job中配置的相关参数，以及job所用的java类所在的jar包，提交给yarn去运行 boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } }

package com.atguigu.mapreduce.flowsum;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
public class FlowsumDriver {
 
       public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException {
             
              // 1 获取配置信息，或者job对象实例
              Configuration configuration = new Configuration();
              Job job = Job.getInstance(configuration);
 
              // 6 指定本程序的jar包所在的本地路径
              job.setJarByClass(FlowsumDriver.class);
 
              // 2 指定本业务job要使用的mapper/Reducer业务类
              job.setMapperClass(FlowCountMapper.class);
              job.setReducerClass(FlowCountReducer.class);
 
              // 3 指定mapper输出数据的kv类型
              job.setMapOutputKeyClass(Text.class);
              job.setMapOutputValueClass(FlowBean.class);
 
              // 4 指定最终输出的数据的kv类型
              job.setOutputKeyClass(Text.class);
              job.setOutputValueClass(FlowBean.class);
 
              // 8 指定自定义数据分区
              job.setPartitionerClass(ProvincePartitioner.class);
              // 9 同时指定相应数量的reduce task
              job.setNumReduceTasks(5);
             
              // 5 指定job的输入原始文件所在目录
              FileInputFormat.setInputPaths(job, new Path(args[0]));
              FileOutputFormat.setOutputPath(job, new Path(args[1]));
 
              // 7 将job中配置的相关参数，以及job所用的java类所在的jar包， 提交给yarn去运行
              boolean result = job.waitForCompletion(true);
              System.exit(result ? 0 : 1);
       }
}

2.3 需求3：将统计结果按照总流量倒序排序（全排序）

0）需求

根据需求1产生的结果再次对总流量进行排序。

1）数据准备

数据同上

2）分析

（1）把程序分两步走，第一步正常统计总流量，第二步再把结果进行排序

（2）context.write(总流量，手机号)

（3）FlowBean实现WritableComparable接口重写compareTo方法

@Override publicint compareTo(FlowBean o) { // 倒序排列，从大到小 returnthis.sumFlow > o.getSumFlow() ? -1 : 1; }

3）代码实现

（1）FlowBean对象在在需求1基础上增加了比较功能

package com.atguigu.mapreduce.sort; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.WritableComparable; publicclass FlowBean implements WritableComparable<FlowBean> { privatelong upFlow; privatelong downFlow; privatelong sumFlow; // 反序列化时，需要反射调用空参构造函数，所以必须有 public FlowBean() { super(); } public FlowBean(long upFlow, long downFlow) { super(); this.upFlow = upFlow; this.downFlow = downFlow; this.sumFlow = upFlow + downFlow; } publicvoid set(long upFlow, long downFlow) { this.upFlow = upFlow; this.downFlow = downFlow; this.sumFlow = upFlow + downFlow; } publiclong getSumFlow() { return sumFlow; } publicvoid setSumFlow(long sumFlow) { this.sumFlow = sumFlow; } publiclong getUpFlow() { return upFlow; } publicvoid setUpFlow(long upFlow) { this.upFlow = upFlow; } publiclong getDownFlow() { return downFlow; } publicvoid setDownFlow(long downFlow) { this.downFlow = downFlow; } /** * 序列化方法 * @param out * @throws IOException / @Override publicvoid write(DataOutput out) throws IOException { out.writeLong(upFlow); out.writeLong(downFlow); out.writeLong(sumFlow); } /* * 反序列化方法注意反序列化的顺序和序列化的顺序完全一致 * @param in * @throws IOException */ @Override publicvoid readFields(DataInput in) throws IOException { upFlow = in.readLong(); downFlow = in.readLong(); sumFlow = in.readLong(); } @Override public String toString() { return upFlow + "\t" + downFlow + "\t" + sumFlow; } @Override publicint compareTo(FlowBean o) { // 倒序排列，从大到小 returnthis.sumFlow > o.getSumFlow() ? -1 : 1; } }

package com.atguigu.mapreduce.sort;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
 
publicclass FlowBean implements WritableComparable<FlowBean> {
 
       privatelong upFlow;
       privatelong downFlow;
       privatelong sumFlow;
 
       // 反序列化时，需要反射调用空参构造函数，所以必须有
       public FlowBean() {
              super();
       }
 
       public FlowBean(long upFlow, long downFlow) {
              super();
              this.upFlow = upFlow;
              this.downFlow = downFlow;
              this.sumFlow = upFlow + downFlow;
       }
 
       publicvoid set(long upFlow, long downFlow) {
              this.upFlow = upFlow;
              this.downFlow = downFlow;
              this.sumFlow = upFlow + downFlow;
       }
 
       publiclong getSumFlow() {
              return sumFlow;
       }
 
       publicvoid setSumFlow(long sumFlow) {
              this.sumFlow = sumFlow;
       }
 
       publiclong getUpFlow() {
              return upFlow;
       }
 
       publicvoid setUpFlow(long upFlow) {
              this.upFlow = upFlow;
       }
 
       publiclong getDownFlow() {
              return downFlow;
       }
 
       publicvoid setDownFlow(long downFlow) {
              this.downFlow = downFlow;
       }
 
       /**
        * 序列化方法
        * @param out
        * @throws IOException
        */
       @Override
       publicvoid write(DataOutput out) throws IOException {
              out.writeLong(upFlow);
              out.writeLong(downFlow);
              out.writeLong(sumFlow);
       }
 
       /**
        * 反序列化方法 注意反序列化的顺序和序列化的顺序完全一致
        * @param in
        * @throws IOException
        */
       @Override
       publicvoid readFields(DataInput in) throws IOException {
              upFlow = in.readLong();
              downFlow = in.readLong();
              sumFlow = in.readLong();
       }
 
       @Override
       public String toString() {
              return upFlow + "\t" + downFlow + "\t" + sumFlow;
       }
 
       @Override
       publicint compareTo(FlowBean o) {
              // 倒序排列，从大到小
              returnthis.sumFlow > o.getSumFlow() ? -1 : 1;
       }
}

（2）编写mapper

package com.atguigu.mapreduce.sort; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class FlowCountSortMapper extends Mapper<LongWritable, Text, FlowBean, Text>{ FlowBean bean = new FlowBean(); Text v = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 1 获取一行 String line = value.toString(); // 2 截取 String[] fields = line.split("\t"); // 3 封装对象 String phoneNbr = fields[0]; long upFlow = Long.parseLong(fields[1]); long downFlow = Long.parseLong(fields[2]); bean.set(upFlow, downFlow); v.set(phoneNbr); // 4 输出 context.write(bean, v); } }

package com.atguigu.mapreduce.sort;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
 
public class FlowCountSortMapper extends Mapper<LongWritable, Text, FlowBean, Text>{
       FlowBean bean = new FlowBean();
       Text v = new Text();
 
       @Override
       protected void map(LongWritable key, Text value, Context context)
                     throws IOException, InterruptedException {
 
              // 1 获取一行
              String line = value.toString();
             
              // 2 截取
              String[] fields = line.split("\t");
             
              // 3 封装对象
              String phoneNbr = fields[0];
              long upFlow = Long.parseLong(fields[1]);
              long downFlow = Long.parseLong(fields[2]);
             
              bean.set(upFlow, downFlow);
              v.set(phoneNbr);
             
              // 4 输出
              context.write(bean, v);
       }
}

（3）编写reducer

package com.atguigu.mapreduce.sort; import java.io.IOException; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class FlowCountSortReducer extends Reducer<FlowBean, Text, Text, FlowBean>{ @Override protected void reduce(FlowBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException { // 循环输出，避免总流量相同情况 for (Text text : values) { context.write(text, key); } } }

package com.atguigu.mapreduce.sort;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
 
public class FlowCountSortReducer extends Reducer<FlowBean, Text, Text, FlowBean>{
 
       @Override
       protected void reduce(FlowBean key, Iterable<Text> values, Context context)
                     throws IOException, InterruptedException {
             
              // 循环输出，避免总流量相同情况
              for (Text text : values) {
                     context.write(text, key);
              }
       }
}

（4）编写driver

package com.atguigu.mapreduce.sort; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class FlowCountSortDriver { public static void main(String[] args) throws ClassNotFoundException, IOException, InterruptedException { // 1 获取配置信息，或者job对象实例 Configuration configuration = new Configuration(); Job job = Job.getInstance(configuration); // 6 指定本程序的jar包所在的本地路径 job.setJarByClass(FlowCountSortDriver.class); // 2 指定本业务job要使用的mapper/Reducer业务类 job.setMapperClass(FlowCountSortMapper.class); job.setReducerClass(FlowCountSortReducer.class); // 3 指定mapper输出数据的kv类型 job.setMapOutputKeyClass(FlowBean.class); job.setMapOutputValueClass(Text.class); // 4 指定最终输出的数据的kv类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(FlowBean.class); // 5 指定job的输入原始文件所在目录 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 7 将job中配置的相关参数，以及job所用的java类所在的jar包，提交给yarn去运行 boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } }

package com.atguigu.mapreduce.sort;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
public class FlowCountSortDriver {
 
       public static void main(String[] args) throws ClassNotFoundException, IOException, InterruptedException {
             
              // 1 获取配置信息，或者job对象实例
              Configuration configuration = new Configuration();
              Job job = Job.getInstance(configuration);
 
              // 6 指定本程序的jar包所在的本地路径
              job.setJarByClass(FlowCountSortDriver.class);
 
              // 2 指定本业务job要使用的mapper/Reducer业务类
              job.setMapperClass(FlowCountSortMapper.class);
              job.setReducerClass(FlowCountSortReducer.class);
 
              // 3 指定mapper输出数据的kv类型
              job.setMapOutputKeyClass(FlowBean.class);
              job.setMapOutputValueClass(Text.class);
 
              // 4 指定最终输出的数据的kv类型
              job.setOutputKeyClass(Text.class);
              job.setOutputValueClass(FlowBean.class);
 
              // 5 指定job的输入原始文件所在目录
              FileInputFormat.setInputPaths(job, new Path(args[0]));
              FileOutputFormat.setOutputPath(job, new Path(args[1]));
             
              // 7 将job中配置的相关参数，以及job所用的java类所在的jar包， 提交给yarn去运行
              boolean result = job.waitForCompletion(true);
              System.exit(result ? 0 : 1);
       }
}

2.4 需求4：不同省份输出文件内部排序（部分排序）

1）需求

要求每个省份手机号输出的文件中按照总流量内部排序。

2）分析：

基于需求3，增加自定义分区类即可。

3）案例实操

（1）增加自定义分区类

package com.atguigu.mapreduce.sort; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Partitioner; public class ProvincePartitioner extends Partitioner<FlowBean, Text> { @Override public int getPartition(FlowBean key, Text value, int numPartitions) { // 1 获取手机号码前三位 String preNum = value.toString().substring(0, 3); int partition = 4; // 2 根据手机号归属地设置分区 if ("136".equals(preNum)) { partition = 0; }else if ("137".equals(preNum)) { partition = 1; }else if ("138".equals(preNum)) { partition = 2; }else if ("139".equals(preNum)) { partition = 3; } return partition; } }

package com.atguigu.mapreduce.sort;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
 
public class ProvincePartitioner extends Partitioner<FlowBean, Text> {
 
       @Override
       public int getPartition(FlowBean key, Text value, int numPartitions) {
             
              // 1 获取手机号码前三位
              String preNum = value.toString().substring(0, 3);
             
              int partition = 4;
             
              // 2 根据手机号归属地设置分区
              if ("136".equals(preNum)) {
                     partition = 0;
              }else if ("137".equals(preNum)) {
                     partition = 1;
              }else if ("138".equals(preNum)) {
                     partition = 2;
              }else if ("139".equals(preNum)) {
                     partition = 3;
              }
 
              return partition;
       }
}