MapReduce实现手机上网日志分析（排序）

转载

mob604756fda125 2016-12-14 12:54:00

文章标签 sed java 序列化写内存 apache 文章分类 数据结构与算法人工智能

一、背景

1.1 流程

　　实现排序，分组拍上一篇通过Partitioner实现了。

　　实现接口，自动产生接口方法，写属性，产生getter和setter，序列化和反序列化属性，写比较方法，重写toString，为了方便复制写够着方法，不过重写够着方法map里需要不停地new，发现LongWritable有set方法，text也有，可以用，产生默认够着方法。

  public void set(String account,double income,double expense,double surplus) {
    this.account = account;
    this.income = income;
    this.expense = expense;
    this.surplus = income-expense;
  }

1.2 数据集

为了和上一篇保在知识上持递进，数据及换了，名字没变。

MapReduce实现手机上网日志分析（排序）_apache

　　下面是输出结果，其实mr也会自动排序，不过string按字典序排序了。

MapReduce实现手机上网日志分析（排序）_序列化_02

二、理论知识

　　字符串拼接，记得以前自己写过，现在拿出来看看，javascript:void(0)archive/2012/10/18/2729112.html

　　简单总结扩展如下：String是final的，不能改变也不能继承，因此在每次对 String 类型进行改变的时候其实都等同于生成了一个新的 String 对象，然后将指针指向新的 String 对象，所以经常改变内容的字符串最好不要用 String ，因为每次生成对象都会对系统性能产生影响，特别当内存中无引用对象多了以后， JVM 的 GC 就会开始工作，那速度是一定会相当慢的。

　　如果for循环1w次，这句 string += "hello";的过程相当于将原有的string变量指向的对象内容取出与"hello"作字符串相加操作再存进另一个新的String对象当中，再让string变量指向新生成的对象。反编译出的字节码文件可以很清楚地看出，每次循环会new出一个StringBuilder对象，然后进行append操作，最后通过toString方法返回String对象。也就是说这个循环执行完毕new出了10000个对象，试想一下，如果这些对象没有被回收，内存浪费不说，有可能重复使用赵成系统卡死。从上面还可以看出：string+="hello"的操作事实上会自动被JVM优化成：

　　StringBuilder str = new StringBuilder(string);

　　str.append("hello");

　　str.toString();

　　如果直接for循环里StringBuilder 的话会只是new一次。效率高。

　　而StringBuffer是线程安全的，多了synchronized关键字，也就是在多线程下会顺序读取换冲刺。

三、实体类

　　收入相同的话按消费从低到高，否则收入从高到低。

package cn.app.hadoop.mr.sort;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.math.BigDecimal;

import org.apache.hadoop.io.WritableComparable;
import org.apache.jasper.tagplugins.jstl.core.Out;

//Writable是序列化接口
//泛型是InfoBean，就像比较学生信息一样，成绩，性别等 ，封装在了一个bean里
//不过发现WritableComparable  有了序列化和反序列化
public class InfoBean implements WritableComparable<InfoBean>{
  
  
  private String account;
  //金钱类都需要BigDecimal，double顺势精度,不过不知道下边序列化咋写类型，所以先用double，估计writeUTF可以
  private double income;
  private double expense;
  private double surplus;
  
  
  public String getAccount() {
    return account;
  }
  public void setAccount(String account) {
    this.account = account;
  }
  public double getIncome() {
    return income;
  }
  public void setIncome(double income) {
    this.income = income;
  }
  public double getExpense() {
    return expense;
  }
  public void setExpense(double expense) {
    this.expense = expense;
  }
  public double getSurplus() {
    return surplus;
  }
  public void setSurplus(double surplus) {
    this.surplus = surplus;
  }
  public void readFields(DataInput in) throws IOException {
    // TODO Auto-generated method stub
    this.account = in.readUTF();
    this.income = in.readDouble();
    this.expense = in.readDouble();
    this.surplus = in.readDouble();
  }
  public void write(DataOutput out) throws IOException {
    // TODO Auto-generated method stub
    out.writeUTF(account);
    out.writeDouble(income);
    out.writeDouble(expense);
    out.writeDouble(surplus);
    
  }
  
  public void set(String account,double income,double expense) {
    this.account = account;
    this.income = income;
    this.expense = expense;
    this.surplus = income - expense;
  }
  

  public InfoBean() {
    super();
    // TODO Auto-generated constructor stub
  }
  @Override
  public String toString() {
    return "InfoBean [income=" + income + ", expense=" + expense
        + ", surplus=" + surplus + "]";
  }
  public int compareTo(InfoBean o) {
    // TODO Auto-generated method stub
    if(this.income == o.getIncome()) {
      return this.expense>o.getExpense()?1:-1;
    }else {
      return this.income>o.getIncome()?-1:1;
    }
  }
}

四、第一种实现

4.1 Mapper

//第一个处理文本的话一般是LongWritable  或者object
//一行一行的文本是text
//输出的key的手机号 定位Text
//结果是DataBean  一定要实现Writable接口
public class InfoSortMapper extends Mapper<LongWritable, Text, Text, InfoBean> {

  
  private InfoBean v = new InfoBean();
  private Text k = new Text();
  
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    String line = value.toString();
    String[] fields = line.split("\t");
    String account = fields[0];
    double in = Double.parseDouble(fields[1]);
    double out = Double.parseDouble(fields[2]);
    
    //不用每次new  几遍不重写内存引用，也很站用资源
    k.set(account);
    v.set(account, in, out);
    
    context.write(k, v);
  }

　　4.2 Reducer

public class InfoSortReducer extends Reducer<Text, InfoBean, Text, InfoBean> {

  //k就是key，不需要
  private InfoBean v = new InfoBean();
  public void reduce(Text key, Iterable<InfoBean> value, Context context)
      throws IOException, InterruptedException {
    // process values
    double incomeSum = 0;
    double expenseSum = 0;
    for (InfoBean o : value) {
      incomeSum += o.getIncome();
      expenseSum += o.getExpense();
    }
    v.set(key.toString(), incomeSum, expenseSum);
    //databean会自动调用toString
    context.write(key,v);
  }
}

五、第二种实现

5.1 Mapper

//对 InfoBean  排序  k2就是他
public class SortMapper extends Mapper<LongWritable, Text, InfoBean, NullWritable> {

  
  private InfoBean k = new InfoBean();
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    String line = value.toString();
    String[] fields = line.split("\t");
    String account = fields[0];
    double in = Double.parseDouble(fields[1]);
    double out = Double.parseDouble(fields[2]);
    
    //不用每次new  几遍不重写内存引用，也很站用资源
    k.set(account, in, out);
    //value必须是NullWritable.get()，NullWritable不行，提示不是变量
    context.write(k, NullWritable.get());
  }
}

　　5.2 Reducer

//对 InfoBean  排序  k2就是他
public class SortMapper extends Mapper<LongWritable, Text, InfoBean, NullWritable> {

  
  private InfoBean k = new InfoBean();
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    String line = value.toString();
    String[] fields = line.split("\t");
    String account = fields[0];
    double in = Double.parseDouble(fields[1]);
    double out = Double.parseDouble(fields[2]);
    
    //不用每次new  几遍不重写内存引用，也很站用资源
    k.set(account, in, out);
    //value必须是NullWritable.get()，NullWritable不行，提示不是变量
    context.write(k, NullWritable.get());
  }
}

六、结束语

　　如果k2 v2和k4 v4，也就是mapp的输出和reducer的输出类型不一致的话必须在Main里也设置Mapper的输出,上面的第二种就是。

job.setMapOutputKeyClass(InfoBean.class);
    job.setMapOutputValueClass(NullWritable.class);
    
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(InfoBean.class);

　　否则java里不报错，加上log4j后看到类型不匹配。

作者：火星十一郎

本文版权归作者火星十一郎所有，欢迎转载和商用，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利.

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。