尽管Hadoop 框架是使用Java编写的但是我们仍然需要使用像C++、Python等语言来实现 Hadoop程序。尽管Hadoop官方网站给的示例程序是使用Jython编写并打包成Jar文件,这样显然造成了不便,其实,不一定非要这样来实现,我们可以使用Python与Hadoop 关联进行编程。

我们想要做什么?

我们将编写一个简单的 MapReduce 程序,使用的是C-Python,而不是Jython编写后打包成jar包的程序。
我们的这个例子将模仿 WordCount 并使用Python来实现,例子通过读取文本文件来统计出单词的出现次数。结果也以文本形式输出,每一行包含一个单词和单词出现的次数,两者中间使用制表符来相间隔。

前提条件

编写这个程序之前,你学要架设好Hadoop 集群,这样才能不会在后期工作抓瞎。如果你没有架设好,那么在后面有个简明教程来教你在Ubuntu Linux 上搭建(同样适用于其他发行版linux、unix)

如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立单节点的 Hadoop 集群
如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立多节点的 Hadoop 集群

Python的MapReduce代码

使用Python编写MapReduce代码的技巧就在于我们使用了 HadoopStreaming 来帮助我们在Map 和 Reduce间传递数据通过STDIN (标准输入)和STDOUT (标准输出).我们仅仅使用Python的sys.stdin来输入数据,使用sys.stdout输出数据,这样做是因为HadoopStreaming会帮我们办好其他事。

 

Map: mapper.py

将下列的代码保存在/root/zhangjian/test/mapper.py(系统中的任意位置)中,它将从STDIN读取数据并将单词成行分隔开,生成一个列表映射单词与发生次数的关系:
注意:要确保这个脚本有足够权限(chmod +x /root/zhangjian/test/mapper.py)。

#!/usr/bin/env python 

import sys
import re

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
#words = line.split()
words = re.split(',| |:|\.',line)
# increase counters
for word in words:
# write the results to STDOUT (standard output)
# what we output here will be the input for the Reduce step, i.e. the input for reducer.py

# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word,1)

在这个脚本中,并不计算出单词出现的总数,它将输出 "<word> 1" 迅速地,尽管<word>可能会在输入中出现多次,计算是留给后来的Reduce步骤(或叫做程序)来实现。

 

Reduce: reducer.py

将代码存储在/root/zhangjian/test/reducer.py 中,这个脚本的作用是从mapper.py 的STDIN中读取结果,然后计算每个单词出现次数的总和,并输出结果到STDOUT。
同样,要注意脚本权限:chmod +x /root/zhangjian/test/reducer.py

 

#!/usr/bin/env python

from operator import itemgetter
import sys

# maps words to their counts
word2count = {}# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()# parse the input we got from mapper.py
word, count = line.split('\t',1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word,0) + count
except ValueError:
# count was not a number , so silently
# ignore/discard this line
pass# sort the words lexigraphically
#
# this step is NOT required, we just do it so that our final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word,count in sorted_word2count:
    print '%s\t%s' % (word, count)

 

代码测试

 

在运行MapReduce job测试前,尝试手工测试你的mapper.py 和 reducer.py脚本,以免得不到任何返回结果
测试你的Map和Reduce的功能:

echo "zhangjian come from shandong,shandong is a good space" | /root/zhangjian/test/mapper.py

输出:

zhangjian 1
come	1
from	1
shandong	1
shandong	1
is	1
a	1
good	1
space	1

echo "zhangjian come from shandong,shandong is a good space" | /root/zhangjian/test/mapper.py | /root/zhangjian/test/reducer.py 

输出:

a 1
come	1
from	1
good	1
is	1
shandong	2
space	1
zhangjian	1

 

为了测试MapReduce的运行结果,我们需要生成3个文件:1.txt,2.txt,3.txt,为了方便起见,在此,3个文件中是多行上述句子,1.txt中有9926行,2.txt中有119112行,3.txt中有238224行

将这三个文件存放在  /root/zhangjian/test/file  中

[root@dev-slave1 file]# ls -l /root/zhangjian/test/file
total 19372
-rw-r--r-- 1 root root   536004 Sep 10 11:05 2.txt
-rw-r--r-- 1 root root  6432048 Sep 10 11:06 3.txt
-rw-r--r-- 1 root root 12864096 Sep 10 11:06 4.txt

 

在我们运行MapReduce job 前,我们需要将本地的文件复制到HDFS中:

首先,我们要在HDFS上创建一个目录文件夹,用来存放待处理的源文件和最终的输出结果:

hadoop fs -mkdir -p /test_file

然后,我们要将本地文件复制到HDFS上:

hadoop fs -copyFromLocal /root/zhangjian/test/file /test_file

结果如下:

[root@dev-slave1 file]# hadoop fs -ls /test_file
Found 1 items
drwxr-xr-x   - root supergroup          0 2015-09-10 15:12 /test_file/file[root@dev-slave1 file]# hadoop fs -ls /test_file/file
Found 3 items
-rw-r--r--   1 root supergroup     536004 2015-09-10 11:35 /test_file/file/1.txt
-rw-r--r--   1 root supergroup    6432048 2015-09-10 11:35 /test_file/file/2.txt
-rw-r--r--   1 root supergroup   12864096 2015-09-10 11:35 /test_file/file/3.txt

 

至此,一切准备就绪,我们将在运行Python MapReduce job 在Hadoop集群上,命令如下:

hadoop jar /root/develop/src/hadoop/hadoop-tools/hadoop-streaming/target/hadoop-streaming-2.7.1.jar -mapper 'python /root/zhangjian/test/mapper.py' -file /root/zhangjian/test/mapper.py -reducer 'python /root/zhangjian/test/reducer.py' -file /root/zhangjian/test/reducer.py -input /test_file/file/* -output /test_file/file/file_output
其中,两个-file参数不可少,否则会造成错误:Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2

如果上面运行出错,重新运行前,需要删除dfs中的file_output文件

 

 

运行结果如下:

15/09/10 15:12:18 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/root/zhangjian/test/mapper.py, /root/zhangjian/test/reducer.py, /tmp/hadoop-unjar4712976301095870488/] [] /tmp/streamjob2552627823217601490.jar tmpDir=null
15/09/10 15:12:19 INFO client.RMProxy: Connecting to ResourceManager at dev-master/172.16.10.51:8032
15/09/10 15:12:19 INFO client.RMProxy: Connecting to ResourceManager at dev-master/172.16.10.51:8032
15/09/10 15:12:20 INFO mapred.FileInputFormat: Total input paths to process : 3
15/09/10 15:12:20 INFO mapreduce.JobSubmitter: number of splits:4
15/09/10 15:12:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1440156752555_0040
15/09/10 15:12:20 INFO impl.YarnClientImpl: Submitted application application_1440156752555_0040
15/09/10 15:12:21 INFO mapreduce.Job: The url to track the job: http://dev-master:8088/proxy/application_1440156752555_0040/
15/09/10 15:12:21 INFO mapreduce.Job: Running job: job_1440156752555_0040
15/09/10 15:12:27 INFO mapreduce.Job: Job job_1440156752555_0040 running in uber mode : false
15/09/10 15:12:27 INFO mapreduce.Job:  map 0% reduce 0%
15/09/10 15:12:38 INFO mapreduce.Job:  map 25% reduce 0%
15/09/10 15:12:39 INFO mapreduce.Job:  map 50% reduce 0%
15/09/10 15:12:41 INFO mapreduce.Job:  map 75% reduce 0%
15/09/10 15:12:42 INFO mapreduce.Job:  map 92% reduce 0%
15/09/10 15:12:43 INFO mapreduce.Job:  map 100% reduce 0%15/09/10 15:13:06 INFO mapreduce.Job: map 100% reduce 69%
15/09/10 15:13:09 INFO mapreduce.Job:  map 100% reduce 79%
15/09/10 15:13:12 INFO mapreduce.Job:  map 100% reduce 81%
15/09/10 15:13:15 INFO mapreduce.Job:  map 100% reduce 100%
15/09/10 15:13:15 INFO mapreduce.Job: Job job_1440156752555_0040 completed successfully
15/09/10 15:13:15 INFO mapreduce.Job: Counters: 50
	    File System Counters
		        FILE: Number of bytes read=33053586
		        FILE: Number of bytes written=66702385
		        FILE: Number of read operations=0
		        FILE: Number of large read operations=0
		        FILE: Number of write operations=0
		        HDFS: Number of bytes read=19832870
		        HDFS: Number of bytes written=101
		        HDFS: Number of read operations=15
		        HDFS: Number of large read operations=0
		        HDFS: Number of write operations=2
	    Job Counters 
		        Failed reduce tasks=1
		        Launched map tasks=4
		        Launched reduce tasks=2
		        Data-local map tasks=4        Total time spent by all maps in occupied slots (ms)=88718
		        Total time spent by all reduces in occupied slots (ms)=54844
		        Total time spent by all map tasks (ms)=44359
		        Total time spent by all reduce tasks (ms)=27422
		        Total vcore-seconds taken by all map tasks=44359
		        Total vcore-seconds taken by all reduce tasks=27422
		        Total megabyte-seconds taken by all map tasks=44359000
		        Total megabyte-seconds taken by all reduce tasks=27422000
	    Map-Reduce Framework
		        Map input records=367262
		        Map output records=3305358
		        Map output bytes=26442864
		        Map output materialized bytes=33053604
		        Input split bytes=380
		        Combine input records=0
		        Combine output records=0
		        Reduce input groups=8
		        Reduce shuffle bytes=33053604
		        Reduce input records=3305358
		        Reduce output records=8        Spilled Records=6610716
		        Shuffled Maps =4
		        Failed Shuffles=0
		        Merged Map outputs=4
		        GC time elapsed (ms)=305
		        CPU time spent (ms)=29160
		        Physical memory (bytes) snapshot=1400758272
		        Virtual memory (bytes) snapshot=7690555392
		        Total committed heap usage (bytes)=1355808768
	    Shuffle Errors
		        BAD_ID=0
		        CONNECTION=0
		        IO_ERROR=0
		        WRONG_LENGTH=0
		        WRONG_MAP=0
		        WRONG_REDUCE=0
	    File Input Format Counters 
		        Bytes Read=19832490
	    File Output Format Counters 
		        Bytes Written=101
15/09/10 15:13:15 INFO streaming.StreamJob: Output directory: /test_file/file/file_output
 
检查结果是否输出并存储在HDFS目录下的中:
[root@dev-slave1 file]# hadoop fs -ls /test_file/file
Found 4 items
-rw-r--r--   1 root supergroup     536004 2015-09-10 11:35 /test_file/file/1.txt
-rw-r--r--   1 root supergroup    6432048 2015-09-10 11:35 /test_file/file/2.txt
-rw-r--r--   1 root supergroup   12864096 2015-09-10 11:35 /test_file/file/3.txt
drwxr-xr-x   - root supergroup          0 2015-09-10 15:13 /test_file/file/file_output 
[root@dev-slave1 file]# hadoop fs -ls /test_file/file/file_output
Found 2 items
-rw-r--r--   1 root supergroup          0 2015-09-10 15:13 /test_file/file/file_output/_SUCCESS
-rw-r--r--   1 root supergroup        101 2015-09-10 15:13 /test_file/file/file_output/part-00000 
[root@dev-slave1 file]# hadoop fs -cat /test_file/file/file_output/part-00000
a	367262
come	367262
from	367262
good	367262
is	367262
shandong	734524
space	367262
zhangjian	367262

将HDFS上的结果下载到本地:

hadoop fs -copyToLocal /test_file/file/file_output/part-00000 /root/zhangjian//test_file/file/file_output