尽管Hadoop 框架是使用Java编写的但是我们仍然需要使用像C++、Python等语言来实现 Hadoop程序。尽管Hadoop官方网站给的示例程序是使用Jython编写并打包成Jar文件,这样显然造成了不便,其实,不一定非要这样来实现,我们可以使用Python与Hadoop 关联进行编程。
我们想要做什么?
我们将编写一个简单的 MapReduce 程序,使用的是C-Python,而不是Jython编写后打包成jar包的程序。
我们的这个例子将模仿 WordCount 并使用Python来实现,例子通过读取文本文件来统计出单词的出现次数。结果也以文本形式输出,每一行包含一个单词和单词出现的次数,两者中间使用制表符来相间隔。
前提条件
编写这个程序之前,你学要架设好Hadoop 集群,这样才能不会在后期工作抓瞎。如果你没有架设好,那么在后面有个简明教程来教你在Ubuntu Linux 上搭建(同样适用于其他发行版linux、unix)
如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立单节点的 Hadoop 集群
如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立多节点的 Hadoop 集群
Python的MapReduce代码
使用Python编写MapReduce代码的技巧就在于我们使用了 HadoopStreaming 来帮助我们在Map 和 Reduce间传递数据通过STDIN (标准输入)和STDOUT (标准输出).我们仅仅使用Python的sys.stdin来输入数据,使用sys.stdout输出数据,这样做是因为HadoopStreaming会帮我们办好其他事。
Map: mapper.py
将下列的代码保存在/root/zhangjian/test/mapper.py(系统中的任意位置)中,它将从STDIN读取数据并将单词成行分隔开,生成一个列表映射单词与发生次数的关系:
注意:要确保这个脚本有足够权限(chmod +x /root/zhangjian/test/mapper.py)。
#!/usr/bin/env python
import sys
import re
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
#words = line.split()
words = re.split(',| |:|\.',line)
# increase counters
for word in words:
# write the results to STDOUT (standard output)
# what we output here will be the input for the Reduce step, i.e. the input for reducer.py
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word,1)
在这个脚本中,并不计算出单词出现的总数,它将输出 "<word> 1" 迅速地,尽管<word>可能会在输入中出现多次,计算是留给后来的Reduce步骤(或叫做程序)来实现。
Reduce: reducer.py
将代码存储在/root/zhangjian/test/reducer.py 中,这个脚本的作用是从mapper.py 的STDIN中读取结果,然后计算每个单词出现次数的总和,并输出结果到STDOUT。
同样,要注意脚本权限:chmod +x /root/zhangjian/test/reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys
# maps words to their counts
word2count = {}# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()# parse the input we got from mapper.py
word, count = line.split('\t',1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word,0) + count
except ValueError:
# count was not a number , so silently
# ignore/discard this line
pass# sort the words lexigraphically
#
# this step is NOT required, we just do it so that our final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word,count in sorted_word2count:
print '%s\t%s' % (word, count)
代码测试
在运行MapReduce job测试前,尝试手工测试你的mapper.py 和 reducer.py脚本,以免得不到任何返回结果
测试你的Map和Reduce的功能:
echo "zhangjian come from shandong,shandong is a good space" | /root/zhangjian/test/mapper.py
输出:
zhangjian 1
come 1
from 1
shandong 1
shandong 1
is 1
a 1
good 1
space 1
echo "zhangjian come from shandong,shandong is a good space" | /root/zhangjian/test/mapper.py | /root/zhangjian/test/reducer.py
输出:
a 1
come 1
from 1
good 1
is 1
shandong 2
space 1
zhangjian 1
为了测试MapReduce的运行结果,我们需要生成3个文件:1.txt,2.txt,3.txt,为了方便起见,在此,3个文件中是多行上述句子,1.txt中有9926行,2.txt中有119112行,3.txt中有238224行
将这三个文件存放在 /root/zhangjian/test/file 中
[root@dev-slave1 file]# ls -l /root/zhangjian/test/file
total 19372
-rw-r--r-- 1 root root 536004 Sep 10 11:05 2.txt
-rw-r--r-- 1 root root 6432048 Sep 10 11:06 3.txt
-rw-r--r-- 1 root root 12864096 Sep 10 11:06 4.txt
在我们运行MapReduce job 前,我们需要将本地的文件复制到HDFS中:
首先,我们要在HDFS上创建一个目录文件夹,用来存放待处理的源文件和最终的输出结果:
hadoop fs -mkdir -p /test_file
然后,我们要将本地文件复制到HDFS上:
hadoop fs -copyFromLocal /root/zhangjian/test/file /test_file
结果如下:
[root@dev-slave1 file]# hadoop fs -ls /test_file
Found 1 items
drwxr-xr-x - root supergroup 0 2015-09-10 15:12 /test_file/file[root@dev-slave1 file]# hadoop fs -ls /test_file/file
Found 3 items
-rw-r--r-- 1 root supergroup 536004 2015-09-10 11:35 /test_file/file/1.txt
-rw-r--r-- 1 root supergroup 6432048 2015-09-10 11:35 /test_file/file/2.txt
-rw-r--r-- 1 root supergroup 12864096 2015-09-10 11:35 /test_file/file/3.txt
至此,一切准备就绪,我们将在运行Python MapReduce job 在Hadoop集群上,命令如下:
hadoop jar /root/develop/src/hadoop/hadoop-tools/hadoop-streaming/target/hadoop-streaming-2.7.1.jar -mapper 'python /root/zhangjian/test/mapper.py' -file /root/zhangjian/test/mapper.py -reducer 'python /root/zhangjian/test/reducer.py' -file /root/zhangjian/test/reducer.py -input /test_file/file/* -output /test_file/file/file_output
其中,两个-file参数不可少,否则会造成错误:Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
如果上面运行出错,重新运行前,需要删除dfs中的file_output文件
运行结果如下:
15/09/10 15:12:18 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/root/zhangjian/test/mapper.py, /root/zhangjian/test/reducer.py, /tmp/hadoop-unjar4712976301095870488/] [] /tmp/streamjob2552627823217601490.jar tmpDir=null
15/09/10 15:12:19 INFO client.RMProxy: Connecting to ResourceManager at dev-master/172.16.10.51:8032
15/09/10 15:12:19 INFO client.RMProxy: Connecting to ResourceManager at dev-master/172.16.10.51:8032
15/09/10 15:12:20 INFO mapred.FileInputFormat: Total input paths to process : 3
15/09/10 15:12:20 INFO mapreduce.JobSubmitter: number of splits:4
15/09/10 15:12:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1440156752555_0040
15/09/10 15:12:20 INFO impl.YarnClientImpl: Submitted application application_1440156752555_0040
15/09/10 15:12:21 INFO mapreduce.Job: The url to track the job: http://dev-master:8088/proxy/application_1440156752555_0040/
15/09/10 15:12:21 INFO mapreduce.Job: Running job: job_1440156752555_0040
15/09/10 15:12:27 INFO mapreduce.Job: Job job_1440156752555_0040 running in uber mode : false
15/09/10 15:12:27 INFO mapreduce.Job: map 0% reduce 0%
15/09/10 15:12:38 INFO mapreduce.Job: map 25% reduce 0%
15/09/10 15:12:39 INFO mapreduce.Job: map 50% reduce 0%
15/09/10 15:12:41 INFO mapreduce.Job: map 75% reduce 0%
15/09/10 15:12:42 INFO mapreduce.Job: map 92% reduce 0%
15/09/10 15:12:43 INFO mapreduce.Job: map 100% reduce 0%15/09/10 15:13:06 INFO mapreduce.Job: map 100% reduce 69%
15/09/10 15:13:09 INFO mapreduce.Job: map 100% reduce 79%
15/09/10 15:13:12 INFO mapreduce.Job: map 100% reduce 81%
15/09/10 15:13:15 INFO mapreduce.Job: map 100% reduce 100%
15/09/10 15:13:15 INFO mapreduce.Job: Job job_1440156752555_0040 completed successfully
15/09/10 15:13:15 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=33053586
FILE: Number of bytes written=66702385
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=19832870
HDFS: Number of bytes written=101
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Failed reduce tasks=1
Launched map tasks=4
Launched reduce tasks=2
Data-local map tasks=4 Total time spent by all maps in occupied slots (ms)=88718
Total time spent by all reduces in occupied slots (ms)=54844
Total time spent by all map tasks (ms)=44359
Total time spent by all reduce tasks (ms)=27422
Total vcore-seconds taken by all map tasks=44359
Total vcore-seconds taken by all reduce tasks=27422
Total megabyte-seconds taken by all map tasks=44359000
Total megabyte-seconds taken by all reduce tasks=27422000
Map-Reduce Framework
Map input records=367262
Map output records=3305358
Map output bytes=26442864
Map output materialized bytes=33053604
Input split bytes=380
Combine input records=0
Combine output records=0
Reduce input groups=8
Reduce shuffle bytes=33053604
Reduce input records=3305358
Reduce output records=8 Spilled Records=6610716
Shuffled Maps =4
Failed Shuffles=0
Merged Map outputs=4
GC time elapsed (ms)=305
CPU time spent (ms)=29160
Physical memory (bytes) snapshot=1400758272
Virtual memory (bytes) snapshot=7690555392
Total committed heap usage (bytes)=1355808768
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=19832490
File Output Format Counters
Bytes Written=101
15/09/10 15:13:15 INFO streaming.StreamJob: Output directory: /test_file/file/file_output
检查结果是否输出并存储在HDFS目录下的中:
[root@dev-slave1 file]# hadoop fs -ls /test_file/file
Found 4 items
-rw-r--r-- 1 root supergroup 536004 2015-09-10 11:35 /test_file/file/1.txt
-rw-r--r-- 1 root supergroup 6432048 2015-09-10 11:35 /test_file/file/2.txt
-rw-r--r-- 1 root supergroup 12864096 2015-09-10 11:35 /test_file/file/3.txt
drwxr-xr-x - root supergroup 0 2015-09-10 15:13 /test_file/file/file_output
[root@dev-slave1 file]# hadoop fs -ls /test_file/file/file_output
Found 2 items
-rw-r--r-- 1 root supergroup 0 2015-09-10 15:13 /test_file/file/file_output/_SUCCESS
-rw-r--r-- 1 root supergroup 101 2015-09-10 15:13 /test_file/file/file_output/part-00000
[root@dev-slave1 file]# hadoop fs -cat /test_file/file/file_output/part-00000
a 367262
come 367262
from 367262
good 367262
is 367262
shandong 734524
space 367262
zhangjian 367262
将HDFS上的结果下载到本地:
hadoop fs -copyToLocal /test_file/file/file_output/part-00000 /root/zhangjian//test_file/file/file_output