一、 Hadoop Streaming 和 Python


与前面介绍的Hadoop提供的基于Java 的 MapReduce 编程框架相比,Hadoop Streaming 是另一种形式的MapReduce编程框架。这种编程框架允许Map任务和Reduce任务通过标准输入输出来读数据、写数据,每次一行。任何程序只要能通过标准输入输出来读写数据,就可以使用Hadoop Streaming,即你可以用Python、Ruby这样的动态脚本语言来写程序。相比于Java,这种方式的优势是你可以快速试验你的想法,劣势是运行性能和类型检查。因此在前期的分析建模阶段我们可以用Streaming+Python,提高开发效率,在后期的生产系统中可以用Java,保证运行性能。


以下是WordCount的Streaming + Python实现


1. mapper.py


#!/usr/bin/python
# -*- coding: utf-8 -*- 
"""a python script for hadoop streaming map """

import sys

def map(input):
    for line in input:
        line = line.strip()
        words = line.split()
        for word in words:
            print '%s\t%s' % (word, 1)

def main():
    map(sys.stdin)

if __name__ == "__main__":
    main()

2. reducer.py


#!/usr/bin/python
# -*- coding: utf-8 -*- 
"""a python script for hadoop streaming map """

import sys

def reduce(input):
    current_word = None
    current_count = 0
    word = None
    for line in input:
        line = line.strip()
        
        word, count = line.split('\t', 1)
        try:
            count = int(count)
        except ValueError:
            continue

        if current_word == word:
            current_count += count
        else:
            if current_word:
                print '%s\t%s' %(current_word, current_count)
            current_count = count
            current_word = word

    if current_word == word:
        print '%s\t%s' % (current_word, current_count)

def main():
    reduce(sys.stdin)


if __name__ == "__main__":
    main()

3.exec_streaming.sh

#!/bin/sh
hadoop dfs -rmr streaming_out
hadoop jar /home/hadoop/cloud/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -input test2.txt -output streaming_out
hadoop dfs -cat streaming_out/part-00000

此脚本包括3条命令。

第1条,删除streaming_out目录,以免执行时出错。
第2条,其中 /home/hadoop/cloud/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar  需要根据你所安装的目录调整; -file mapper.py 告诉hadoop分发mapper.py文件,-mapper mapper.py 告诉hadoop用mapper.py作为map程序;-file reducer.py 告诉hadoop分发reducer.py文件,-reducer reducer.py 告诉hadoop用reducer.py作为reduce程序;-input test2.txt 指定输入文件(test2.txt在前面已导入到HDFS),-output streaming_out 指定streaming_out 为输出目录。

第3条,查看执行结果。

二、编程实例之气象数据分析

这是hadoop in action书上第4章的实例。我先用python实现map和reduce,然后再用java实现。

1. cite75_99.txt 的格式

逗号隔开,第一行为标题栏,一共两列,第一列为引用的专利号,第二列为被引用的专利号。第一行数据3858241,956203表示3858241专利引用了956203专利。整个文件有1600多万行。

"CITING","CITED"
3858241,956203
3858241,1324234
3858241,3398406
3858241,3557384
3858241,3634889
3858242,1515701
3858242,3319261
3858242,3668705
3858242,3707004
3858243,2949611
3858243,3146465
3858243,3156927
3858243,3221341
3858243,3574238
3858243,3681785
3858243,3684611
3858244,14040
3858244,17445
3858244,2211676
3858244,2635670
3858244,2838924
3858244,2912700

2. 读取引用数据倒排(invert)

输出格式如下,第一列为被引用的专利,第二列为引用它的专利,

100000	5031388
1000006	4714284
1000007	4766693
1000011	5033339
1000017	3908629
1000026	4043055
1000033	4190903,4975983
1000043	4091523
1000044	4055371,4082383
10000	4539112
1000045	4290571
1000046	5525001,5918892
1000049	5996916
1000051	4541310
1000054	4946631
1000065	4748968
1000067	4944640,5071294,5312208
1000070	4928425,5009029
1000073	4107819,5474494
1000076	4867716,5845593
1000083	5322091,5566726
1000084	4182197,4683770
1000086	4178246,4217220,4686189,4839046
1000089	5277853,5395228,5503546,5505607,5505610,5505611,5540869,5544405,5571464,5807591
1000094	4897975,4920718,5713167
1000102	5120183,5791855

2.1 mapper.py

参照模板写就行了,语句print '%s\t%s' %(words[1],words[0])将作了一个转换

#!/usr/bin/python
# -*- coding: utf-8 -*- 

import sys

def map(input):
    for line in input:
        line = line.strip()
        words = line.split(',')
        if len(words) == 2 :
            print '%s\t%s' %(words[1],words[0])
def main():
    map(sys.stdin)

if __name__ == "__main__":
    main()

2.2 reducer.py

参考前面wordcount的reducer.py,根据要求做部分修改。其中current_key表示被引用的专利号,current_value表示引用它的专利号,多个用逗号隔开,key与value之间用\t隔开。

#!/usr/bin/python
# -*- coding: utf-8 -*- 

import sys

def reduce(input):
    current_key = None
    current_value = None
    key = None
    for line in input:
        line = line.strip()
        
        key, value = line.split('\t')
        
        if current_key == key:
            current_value += (',' + value)
        else:
            if current_key:
                print '%s\t%s' %(current_key, current_value)
            current_value = value
            current_key = key

    if current_key == key:
        print '%s\t%s' % (current_key, current_value)

def main():
    reduce(sys.stdin)

if __name__ == "__main__":
    main()



2.3  本地验证

虽然有1600万行,速度还能接受。 more output.txt太慢,看看前面的行就可以了。注意一点要用sort命令排序,否则结果不对。

$ wc -l cite75_99.txt
16522439 cite75_99.txt
$ cat cite75_99.txt | ./mapper.py | sort | ./reducer.py > output.txt
$ wc -l output.txt
3276728 output.txt
$ more output.txt

2.4 编写脚本到hadoop 上运行

exec_streaming.sh 相对于wordcount 的脚步不一样,原因是我重新在一个ubuntu的虚拟机上配置了一个伪分布环境,这样开发时就不用启动3个虚拟机了。

#!/bin/sh
hadoop dfs -rmr /examples/patent/streaming_out
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -input /examples/pat
ent/cite75_99.txt -output /examples/patent/streaming_out
hadoop dfs -cat /examples/patent/streaming_out/part-00000

3 计数和直方图

前面得到了一个专利引用的列表,统计的一个基本功能是计数,简单修改reducer.py可以很快实现这个功能。程序reducer_count.py 如下

#!/usr/bin/python
# -*- coding: utf-8 -*- 

import sys

def reduce(input):
    current_key = None
    current_count = 0
    key = None
    for line in input:
        line = line.strip()

        key, value = line.split('\t')
        count = len(value.split(','))

        if current_key == key:
            current_count += count
        else:
            if current_key:
                print '%s\t%s' %(current_key, current_count)
            current_count = count
            current_key = key

    if current_key == key:
        print '%s\t%s' % (current_key, current_count)

def main():
    reduce(sys.stdin)

if __name__ == "__main__":
    main()



计数输出的文件output_count.txt如下

100000  1
1000006 1
1000007 1
1000011 1
1000017 1
1000026 1
1000033 2
1000043 1
1000044 2
10000   1
1000045 1
1000046 2
1000049 1
1000051 1
1000054 1
1000065 1
1000067 3
1000070 2
1000073 2
1000076 2
1000083 2
1000084 2
1000086 4
1000089 10

以output_count.txt作为输入,在此基础上进一步进行直方图统计。mapper_histogram.py如下

#!/usr/bin/python
# -*- coding: utf-8 -*- 

import sys

def map(input):
    for line in input:
        line = line.strip()
        words = line.split('\t')
        if len(words) == 2 :
            print '%s\t%s' %(words[1],1)
def main():
    map(sys.stdin)


if __name__ == "__main__":
    main()

reducer_histogram.py如下

#!/usr/bin/python
# -*- coding: utf-8 -*- 

import sys


def reduce(input):
    current_key = None
    current_count = 0
    key = None
    for line in input:
        line = line.strip()
        
        words  = line.split('\t')
        key = words[0]
        count = 0

        if len(words) == 2:
            count = 1;

        if current_key == key:
            current_count += count
        else:
            if current_key:
                print '%s\t%s' %(current_key, current_count)
            current_count = count
            current_key = key

    if current_key == key:
        print '%s\t%s' % (current_key, current_count)

def main():
    reduce(sys.stdin)


if __name__ == "__main__":
    main()

执行命令  cat output_count.txt | ./mapper_histogram.py | sort -k1 -n |./reducer_histogram.py >output_histogram.txt

得到直方图统计,输出如下, 第1列表示被引用的次数,第2列表示专利数,第一行数据表示只引用一次的专利数据有942232个。

1	942232
2	551843
3	379462
4	277848
5	210438
6	162891
7	127743
8	102050
9	82048
10	66578
11	53835
12	44966
13	37055
14	31178
15	26208
16	22024
17	18896
18	16123
19	13697
20	11856
21	10348
22	9028

4. 用Java实现

4.1 专利引用倒排

PatentInvert.java

import java.io.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class PatentInvert
{
    public static class PatentMapper extends Mapper<Object, Text, Text, Text>
    {
        private Text key2 = new Text();
        private Text value2 = new Text();
        
        public void map(Object key1, Text value1, Context context) throws IOException, InterruptedException
        {
            String[] words = value1.toString().split(",");
            if(words != null && words.length == 2)
            {
                key2.set(words[1]);
                value2.set(words[0]);
                context.write(key2, value2);
            }
        }
    }


    public static class PatentReducer extends Reducer<Text, Text, Text, Text>
    {
        public void reduce(Text key2, Iterable<Text> values2, Context context) throws IOException, InterruptedException
        {
            Text key3 = key2;
            String value3 = "";
            
            for(Text val: values2)
            {
                if(value3.length() > 0)
                   value3 += ",";

                value3 +=  val.toString();
            }
            context.write(key3, new Text(value3));
        }
    }

    public static void main(String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "Patent invert");
        job.setJarByClass(PatentInvert.class);
        job.setMapperClass(PatentMapper.class);
        job.setReducerClass(PatentReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true)?0:1);
    }
}

build.sh

在脚本中增加了一个参数,是脚本更通用。另外如果编译的文件不存在则提示。

#/bin/sh
HADOOP_LIB_DIR=/usr/local/hadoop/share/hadoop

FILE_NAME=PatentInvert

if [ $# -eq 1 ]; then
    FILE_NAME=$1
fi

rm -f ./*.class
rm -f ./${FILE_NAME}.jar

if [ -f ./${FILE_NAME}.java ]; then

    javac -classpath $HADOOP_LIB_DIR/common/hadoop-common-2.6.0.jar:$HADOOP_LIB_DIR/common/lib/commons-cli-1.2.jar:$HADOOP_LIB_DIR/common/lib/hadoop-annotations-2.6.0.jar:$HADOOP_LIB_DIR/mapreduce/hadoop-mapreduce-client-core-2.6.0.jar -d . ./${FILE_NAME}.java

#package

    jar -cvf ${FILE_NAME}.jar ./*.class

else
   echo "${FILE_NAME}.java is not exist !"
fi

编译打包 build.sh

执行     hadoop jar PatentInvert.jar PatentInvert /examples/patent/cite75_99.txt /examples/patent/out_invert

查看结果  hadoop dfs -cat /examples/patent/out_invert/part-r-00000

查看结果要有耐心,我只看了前面的就 中断了它。


4.2 专利引用倒排计数

PatentCount.java

import java.io.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class PatentCount
{
    public static class PatentMapper extends Mapper<Object, Text, IntWritable, IntWritable>
    {
        private IntWritable key2 = new IntWritable(0);
        private IntWritable value2 = new IntWritable(1);
        
        public void map(Object key1, Text value1, Context context) throws IOException, InterruptedException
        {
            String[] words = value1.toString().split(",");
            if(words != null && words.length == 2)
            {
                try
                {
                    key2.set(Integer.parseInt(words[1].trim()));
                    context.write(key2, value2);
                }catch(Exception e){
                    context.write(new IntWritable(0), value2); 
                }
            }
        }
    }

    public static class PatentReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>
    {
        public void reduce(IntWritable key2, Iterable<IntWritable> values2, Context context) throws IOException, InterruptedException
        {
            IntWritable key3 = key2;
            IntWritable value3 = new IntWritable(0);

            int total = 0;
            
            for(IntWritable val: values2)
            {
                total++;
            }
            value3.set(total);
            context.write(key3, value3);
        }
    }

    public static void main(String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "Patent count");
        job.setJarByClass(PatentCount.class);
        job.setMapperClass(PatentMapper.class);
        job.setReducerClass(PatentReducer.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true)?0:1);
    }
}

编译打包 build.sh PatentCount

 执行 hadoop jar PatentCount.jar PatentCount /examples/patent/cite75_99.txt /examples/patent/out_count

查看结果 hadoop dfs -cat /examples/patent/out_count/part-r-00000


4.3 专利引用直方图

PatentHistogram.java

import java.io.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class PatentHistogram
{
    public static class PatentMapper extends Mapper<Object, Text, IntWritable, IntWritable>
    {
        private IntWritable key2 = new IntWritable(0);
        private IntWritable value2 = new IntWritable(1);
        
        public void map(Object key1, Text value1, Context context) throws IOException, InterruptedException
        {
            String[] words = value1.toString().split("\t");
            if(words != null && words.length == 2)
            {
                try
                {
                    key2.set(Integer.parseInt(words[1].trim()));
                    context.write(key2, value2);
                }catch(Exception e){
                    context.write(new IntWritable(0), value2); 
                }
            }
        }
    }

    public static class PatentReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>
    {
        public void reduce(IntWritable key2, Iterable<IntWritable> values2, Context context) throws IOException, InterruptedException
        {
            IntWritable key3 = key2;
            IntWritable value3 = new IntWritable(0);

            int total = 0;
            
            for(IntWritable val: values2)
            {
                total++;
            }
            value3.set(total);
            context.write(key3, value3);
        }
    }

    public static void main(String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "Patent histogram");
        job.setJarByClass(PatentHistogram.class);
        job.setMapperClass(PatentMapper.class);
        job.setReducerClass(PatentReducer.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true)?0:1);
    }
}



编译打包 build.sh PatentHistogram

复制文件 hadoop dfs -mv /examples/patent/out_count/part-r-00000 /examples/patent/cite75_99_count.txt

执行 hadoop jar PatentHistogram.jar PatentHistogram /examples/patent/cite75_99_count.txt /examples/patent/out_histogram

查看结果 hadoop dfs -cat /examples/patent/out_histogram/part-r-00000

三、体会


把Streaming + python的方式作为开发/调试的工具来使用,确实比较方便。


在做个例子时候,虽然原始的数据有1600万多条记录,python脚本直接处理的速度也还能接受,看来是数据量太小,hadoop的优势没有体现出来。


只要理解了mapper和reducer两个类,理解并记住了WordCount这个最基本的例子,照猫画虎,MapReduce编程入门还是不难的。Mapper 中需要override的 map方法和Reducer中需要override的 reduce方法。


public class Mapper<K1, V1, K2, V2>
{
     void map(K1 key, V1, value Mapper.Context context) throws IOException, InterruptedException
{...}
}
public class Reducer<K2, V2, K3, V3>
{
void reduce(K1 key, Iterable<V2> values, Reducer.Context context) throws IOException, InterruptedException
{...}
}