Data Algorithm》读书笔记六 — 移动平均

在谈移动平均之前,首先需要理解时间序列数据

1.时间序列数据

时间序列数据表示一个变量在一段时间内的值。可以不太严格的把时间序列数据形式化表示为三元组序列:(k,t,v)
一般的,只要在一段时间内记录相同的度量值,就会得到时间序列数据。
多个连续周期的时间序列数据平均值称为移动平均。移动的意思是:随着新的时间序列数据的到来,要不断的重新计算这个平均值,由于会删除最早的值,同时增加最新的值,这个平均值会相应的“移动”。

2. 需求

在本例中,我使用一个模拟的股票数据,计算其在指定窗口中的移动平均问题。

3. 测试数据

3.1.1 测试输入1
GOOG,2004-11-04,184.70
GOOG,2004-11-03,191.67
GOOG,2004-11-02,194.87
GOOG,2013-07-19,896.60
GOOG,2013-07-18,910.68
GOOG,2004-07-17,918.55
3.1.1 测试输出1
GOOG', 2004-11-02', 194.87, 0.0
GOOG', 2004-11-03', 191.67, 193.26999999999998
GOOG', 2004-11-04', 184.7, 188.185
GOOG', 2013-07-17', 918.55, 551.625
GOOG', 2013-07-18', 910.68, 914.615
GOOG', 2013-07-19', 896.6, 903.64
3.1.2 测试输入2
GOOG,2004-11-04,184.70
GOOG,2004-11-03,191.67
GOOG,2004-11-02,194.87
GOOG,2013-07-19,896.60
GOOG,2013-07-18,910.68
GOOG,2004-07-17,918.55
APPL,2013-10-04,483.22
APPL,2013-10-07,485.39
APPL,2013-10-08,484.345
APPL,2013-10-09,483.765
IBM,2013-09-26,189.845
IBM,2013-09-27,188.57
IBM,2013-09-30,186.05
3.1.2 测试输出2
APPL, 2013-10-04, 483.22, 0.0
APPL, 2013-10-07, 485.39, 484.305
APPL, 2013-10-08, 484.345, 484.8675
APPL, 2013-10-09, 483.765, 484.055
GOOG, 2004-11-02, 194.87, 0.0
GOOG, 2004-11-03, 191.67, 193.26999999999998
GOOG, 2004-11-04, 184.7, 188.185
GOOG, 2013-07-17, 918.55, 551.625
GOOG, 2013-07-18, 910.68, 914.615
GOOG, 2013-07-19, 896.6, 903.64
IBM, 2013-09-26, 189.845, 0.0
IBM, 2013-09-27, 188.57, 189.20749999999998
IBM, 2013-09-30, 186.05, 187.31

4.使用普通的java 程序解决移动平均问题

此处略

5.使用 MapReduce job 解决移动平均问题

因为代码较多,这里不一一列出,但是主要的方法都是二次排序分组等方式。这里列出重要的 Stock 类,以及MoveAvgReducer

5.1 Stock
package data_algorithm.chapter_6;

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class Stock implements Writable, WritableComparable<Stock> {
    private String com_name;// company name
    private String date;//the date
    private double price;//stock price

    private double moveAvg;//move average


    public Stock() {
    }

    public Stock(String com_name, String date, double price) {
        this.com_name = com_name;
        this.date = date;
        this.price = price;
    }

    public String getCom_name() {
        return com_name;
    }

    public void setCom_name(String com_name) {
        this.com_name = com_name;
    }

    public String getDate() {
        return date;
    }

    public void setDate(String date) {
        this.date = date;
    }

    public double getPrice() {
        return price;
    }

    public void setPrice(double price) {
        this.price = price;
    }

    public double getMoveAvg() {
        return moveAvg;
    }

    public void setMoveAvg(double moveAvg) {
        this.moveAvg = moveAvg;
    }

    @Override
    public int compareTo(Stock o) {
        int comp = this.com_name.compareTo(o.com_name);
        if (comp == 0) {
            comp = this.date.compareTo(o.date);
        }
        return comp;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(com_name);
        out.writeUTF(date);
        out.writeDouble(price);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.com_name = in.readUTF();
        this.date = in.readUTF();
        this.price = in.readDouble();
    }

    @Override
    public String toString() {
        return  com_name +
                ", " + date  +
                ", " + price +
                ", " + moveAvg ;
    }
}
5.2 MoveAvgReducer
package data_algorithm.chapter_6;

import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MoveAvgReducer extends Reducer<Stock,DoubleWritable,Stock,NullWritable> {
    private int window = 2;//the move windows number

    @Override
    protected void reduce(Stock key,Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {
        double res =0 ;
        double priorVal = 0; // save the prior value
        int count = 0;
        String priorKey ="";
        for (DoubleWritable dw : values) {
            key.setMoveAvg(0); 
            if (count != 0 && priorKey.equals(key.getCom_name())) {
                res = (dw.get() + priorVal) / window;
                key.setMoveAvg(res);// set the final move average value
            }
            priorVal = dw.get();
            priorKey = key.getCom_name();
            count++;
            res = 0;//reset
            context.write(key,NullWritable.get());
        }
        count = 0;
    }
}

全部代码可以在我的github中获取。

6 注意事项

6.1

细心的读者可能会发现在Reducer代码中,有如下的这一行:

key.setMoveAvg(0); 

那么这一行到底是用来干什么的呢?是否多余呢?

这一行代码的目的是用于将每个Stock 对象的moveAvg值置成0。这一步看似多余,但是实际上并不多余,因为如果要注释该行代码,则得到了如下的执行结果:

APPL, 2013-10-04, 483.22, 0.0
APPL, 2013-10-07, 485.39, 484.305
APPL, 2013-10-08, 484.345, 484.8675
APPL, 2013-10-09, 483.765, 484.055
GOOG, 2004-11-02, 194.87, 484.055
GOOG, 2004-11-03, 191.67, 193.26999999999998
GOOG, 2004-11-04, 184.7, 188.185
GOOG, 2013-07-17, 918.55, 551.625
GOOG, 2013-07-18, 910.68, 914.615
GOOG, 2013-07-19, 896.6, 903.64
IBM, 2013-09-26, 189.845, 903.64
IBM, 2013-09-27, 188.57, 189.20749999999998
IBM, 2013-09-30, 186.05, 187.31

可以看到,在每类股票的第一个股票时,除了第一类股票,其moveAvg的值都为非0值。这显然是不正常的,因为我们在程序中设置好了移动平均的窗口为2。所以GOOG, 2004-11-02, 194.87, 484.055 以及 IBM, 2013-09-26, 189.845, 903.64等值都是错误的。但是为什么还是会赋值呢?【具体的原因我也不清楚,所以就这么在前面冗余了一个 setMoveAvg() 操作】