Python的Hadoop编程 python开发hadoop

转载

mob64ca140a8e67 2023-12-13 21:50:29

文章标签 Python的Hadoop编程 mapreduce hadoop HDFS 文章分类 Python 后端开发

尽管Hadoop 框架是使用Java编写的但是我们仍然需要使用像C++、Python等语言来实现 Hadoop程序。尽管Hadoop官方网站给的示例程序是使用Jython编写并打包成Jar文件，这样显然造成了不便，其实，不一定非要这样来实现，我们可以使用Python与Hadoop 关联进行编程。

我们想要做什么？

我们将编写一个简单的 MapReduce 程序，使用的是C-Python，而不是Jython编写后打包成jar包的程序。
我们的这个例子将模仿 WordCount 并使用Python来实现，例子通过读取文本文件来统计出单词的出现次数。结果也以文本形式输出，每一行包含一个单词和单词出现的次数，两者中间使用制表符来相间隔。

前提条件

编写这个程序之前，你学要架设好Hadoop 集群，这样才能不会在后期工作抓瞎。如果你没有架设好，那么在后面有个简明教程来教你在Ubuntu Linux 上搭建（同样适用于其他发行版linux、unix）

如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立单节点的 Hadoop 集群
 如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立多节点的 Hadoop 集群

Python的MapReduce代码

使用Python编写MapReduce代码的技巧就在于我们使用了 HadoopStreaming 来帮助我们在Map 和 Reduce间传递数据通过STDIN (标准输入)和STDOUT (标准输出).我们仅仅使用Python的sys.stdin来输入数据，使用sys.stdout输出数据，这样做是因为HadoopStreaming会帮我们办好其他事。

Map: mapper.py

将下列的代码保存在/root/zhangjian/test/mapper.py(系统中的任意位置)中，它将从STDIN读取数据并将单词成行分隔开，生成一个列表映射单词与发生次数的关系：
注意：要确保这个脚本有足够权限（chmod +x /root/zhangjian/test/mapper.py）。

#!/usr/bin/env python 

import sys
import re

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
#words = line.split()
words = re.split(',| |:|\.',line)
# increase counters
for word in words:
# write the results to STDOUT (standard output)
# what we output here will be the input for the Reduce step, i.e. the input for reducer.py

# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word,1)

在这个脚本中，并不计算出单词出现的总数，它将输出 "<word> 1" 迅速地，尽管<word>可能会在输入中出现多次，计算是留给后来的Reduce步骤（或叫做程序）来实现。

Reduce: reducer.py

将代码存储在/root/zhangjian/test/reducer.py 中，这个脚本的作用是从mapper.py 的STDIN中读取结果，然后计算每个单词出现次数的总和，并输出结果到STDOUT。
同样，要注意脚本权限：chmod +x /root/zhangjian/test/reducer.py

#!/usr/bin/env python

from operator import itemgetter
import sys

# maps words to their counts
word2count = {}# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()# parse the input we got from mapper.py
word, count = line.split('\t',1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word,0) + count
except ValueError:
# count was not a number , so silently
# ignore/discard this line
pass# sort the words lexigraphically
#
# this step is NOT required, we just do it so that our final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word,count in sorted_word2count:
    print '%s\t%s' % (word, count)

代码测试

在运行MapReduce job测试前，尝试手工测试你的mapper.py 和 reducer.py脚本，以免得不到任何返回结果
测试你的Map和Reduce的功能：

echo "zhangjian come from shandong,shandong is a good space" | /root/zhangjian/test/mapper.py

输出：

zhangjian 1
come	1
from	1
shandong	1
shandong	1
is	1
a	1
good	1
space	1

echo "zhangjian come from shandong,shandong is a good space" | /root/zhangjian/test/mapper.py | /root/zhangjian/test/reducer.py

输出：

a 1
come	1
from	1
good	1
is	1
shandong	2
space	1
zhangjian	1

为了测试MapReduce的运行结果，我们需要生成3个文件：1.txt,2.txt,3.txt,为了方便起见，在此，3个文件中是多行上述句子，1.txt中有9926行，2.txt中有119112行，3.txt中有238224行

将这三个文件存放在 /root/zhangjian/test/file 中

[root@dev-slave1 file]# ls -l /root/zhangjian/test/file
total 19372
-rw-r--r-- 1 root root   536004 Sep 10 11:05 2.txt
-rw-r--r-- 1 root root  6432048 Sep 10 11:06 3.txt
-rw-r--r-- 1 root root 12864096 Sep 10 11:06 4.txt

在我们运行MapReduce job 前，我们需要将本地的文件复制到HDFS中：

首先，我们要在HDFS上创建一个目录文件夹，用来存放待处理的源文件和最终的输出结果：

hadoop fs -mkdir -p /test_file

然后，我们要将本地文件复制到HDFS上：

hadoop fs -copyFromLocal /root/zhangjian/test/file /test_file

结果如下：

[root@dev-slave1 file]# hadoop fs -ls /test_file
Found 1 items
drwxr-xr-x   - root supergroup          0 2015-09-10 15:12 /test_file/file[root@dev-slave1 file]# hadoop fs -ls /test_file/file
Found 3 items
-rw-r--r--   1 root supergroup     536004 2015-09-10 11:35 /test_file/file/1.txt
-rw-r--r--   1 root supergroup    6432048 2015-09-10 11:35 /test_file/file/2.txt
-rw-r--r--   1 root supergroup   12864096 2015-09-10 11:35 /test_file/file/3.txt

至此，一切准备就绪，我们将在运行Python MapReduce job 在Hadoop集群上，命令如下：

hadoop jar /root/develop/src/hadoop/hadoop-tools/hadoop-streaming/target/hadoop-streaming-2.7.1.jar -mapper 'python /root/zhangjian/test/mapper.py' -file /root/zhangjian/test/mapper.py -reducer 'python /root/zhangjian/test/reducer.py' -file /root/zhangjian/test/reducer.py -input /test_file/file/* -output /test_file/file/file_output
其中，两个-file参数不可少，否则会造成错误：Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2

如果上面运行出错，重新运行前，需要删除dfs中的file_output文件

运行结果如下：

15/09/10 15:12:18 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/root/zhangjian/test/mapper.py, /root/zhangjian/test/reducer.py, /tmp/hadoop-unjar4712976301095870488/] [] /tmp/streamjob2552627823217601490.jar tmpDir=null
15/09/10 15:12:19 INFO client.RMProxy: Connecting to ResourceManager at dev-master/172.16.10.51:8032
15/09/10 15:12:19 INFO client.RMProxy: Connecting to ResourceManager at dev-master/172.16.10.51:8032
15/09/10 15:12:20 INFO mapred.FileInputFormat: Total input paths to process : 3
15/09/10 15:12:20 INFO mapreduce.JobSubmitter: number of splits:4
15/09/10 15:12:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1440156752555_0040
15/09/10 15:12:20 INFO impl.YarnClientImpl: Submitted application application_1440156752555_0040
15/09/10 15:12:21 INFO mapreduce.Job: The url to track the job: http://dev-master:8088/proxy/application_1440156752555_0040/
15/09/10 15:12:21 INFO mapreduce.Job: Running job: job_1440156752555_0040
15/09/10 15:12:27 INFO mapreduce.Job: Job job_1440156752555_0040 running in uber mode : false
15/09/10 15:12:27 INFO mapreduce.Job:  map 0% reduce 0%
15/09/10 15:12:38 INFO mapreduce.Job:  map 25% reduce 0%
15/09/10 15:12:39 INFO mapreduce.Job:  map 50% reduce 0%
15/09/10 15:12:41 INFO mapreduce.Job:  map 75% reduce 0%
15/09/10 15:12:42 INFO mapreduce.Job:  map 92% reduce 0%
15/09/10 15:12:43 INFO mapreduce.Job:  map 100% reduce 0%15/09/10 15:13:06 INFO mapreduce.Job: map 100% reduce 69%
15/09/10 15:13:09 INFO mapreduce.Job:  map 100% reduce 79%
15/09/10 15:13:12 INFO mapreduce.Job:  map 100% reduce 81%
15/09/10 15:13:15 INFO mapreduce.Job:  map 100% reduce 100%
15/09/10 15:13:15 INFO mapreduce.Job: Job job_1440156752555_0040 completed successfully
15/09/10 15:13:15 INFO mapreduce.Job: Counters: 50
	　　　　File System Counters
		　　　　　　　　FILE: Number of bytes read=33053586
		　　　　　　　　FILE: Number of bytes written=66702385
		　　　　　　　　FILE: Number of read operations=0
		　　　　　　　　FILE: Number of large read operations=0
		　　　　　　　　FILE: Number of write operations=0
		　　　　　　　　HDFS: Number of bytes read=19832870
		　　　　　　　　HDFS: Number of bytes written=101
		　　　　　　　　HDFS: Number of read operations=15
		　　　　　　　　HDFS: Number of large read operations=0
		　　　　　　　　HDFS: Number of write operations=2
	　　　　Job Counters 
		　　　　　　　　Failed reduce tasks=1
		　　　　　　　　Launched map tasks=4
		　　　　　　　　Launched reduce tasks=2
		　　　　　　　　Data-local map tasks=4　　　　　　　　Total time spent by all maps in occupied slots (ms)=88718
		　　　　　　　　Total time spent by all reduces in occupied slots (ms)=54844
		　　　　　　　　Total time spent by all map tasks (ms)=44359
		　　　　　　　　Total time spent by all reduce tasks (ms)=27422
		　　　　　　　　Total vcore-seconds taken by all map tasks=44359
		　　　　　　　　Total vcore-seconds taken by all reduce tasks=27422
		　　　　　　　　Total megabyte-seconds taken by all map tasks=44359000
		　　　　　　　　Total megabyte-seconds taken by all reduce tasks=27422000
	　　　　Map-Reduce Framework
		　　　　　　　　Map input records=367262
		　　　　　　　　Map output records=3305358
		　　　　　　　　Map output bytes=26442864
		　　　　　　　　Map output materialized bytes=33053604
		　　　　　　　　Input split bytes=380
		　　　　　　　　Combine input records=0
		　　　　　　　　Combine output records=0
		　　　　　　　　Reduce input groups=8
		　　　　　　　　Reduce shuffle bytes=33053604
		　　　　　　　　Reduce input records=3305358
		　　　　　　　　Reduce output records=8　　　　　　　　Spilled Records=6610716
		　　　　　　　　Shuffled Maps =4
		　　　　　　　　Failed Shuffles=0
		　　　　　　　　Merged Map outputs=4
		　　　　　　　　GC time elapsed (ms)=305
		　　　　　　　　CPU time spent (ms)=29160
		　　　　　　　　Physical memory (bytes) snapshot=1400758272
		　　　　　　　　Virtual memory (bytes) snapshot=7690555392
		　　　　　　　　Total committed heap usage (bytes)=1355808768
	　　　　Shuffle Errors
		　　　　　　　　BAD_ID=0
		　　　　　　　　CONNECTION=0
		　　　　　　　　IO_ERROR=0
		　　　　　　　　WRONG_LENGTH=0
		　　　　　　　　WRONG_MAP=0
		　　　　　　　　WRONG_REDUCE=0
	　　　　File Input Format Counters 
		　　　　　　　　Bytes Read=19832490
	　　　　File Output Format Counters 
		　　　　　　　　Bytes Written=101
15/09/10 15:13:15 INFO streaming.StreamJob: Output directory: /test_file/file/file_output
 
检查结果是否输出并存储在HDFS目录下的中：
[root@dev-slave1 file]# hadoop fs -ls /test_file/file
Found 4 items
-rw-r--r--   1 root supergroup     536004 2015-09-10 11:35 /test_file/file/1.txt
-rw-r--r--   1 root supergroup    6432048 2015-09-10 11:35 /test_file/file/2.txt
-rw-r--r--   1 root supergroup   12864096 2015-09-10 11:35 /test_file/file/3.txt
drwxr-xr-x   - root supergroup          0 2015-09-10 15:13 /test_file/file/file_output 
[root@dev-slave1 file]# hadoop fs -ls /test_file/file/file_output
Found 2 items
-rw-r--r--   1 root supergroup          0 2015-09-10 15:13 /test_file/file/file_output/_SUCCESS
-rw-r--r--   1 root supergroup        101 2015-09-10 15:13 /test_file/file/file_output/part-00000 
[root@dev-slave1 file]# hadoop fs -cat /test_file/file/file_output/part-00000
a	367262
come	367262
from	367262
good	367262
is	367262
shandong	734524
space	367262
zhangjian	367262

将HDFS上的结果下载到本地：

hadoop fs -copyToLocal /test_file/file/file_output/part-00000 /root/zhangjian//test_file/file/file_output

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：计算fisher线性判别函数例题R语言 fisher线性判别分析例题

下一篇：python 时间下拉框赋值 python日期选择框

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯