场景:程序运行所需要的文件、脚本程序或者配置文件不在hadoop集群上,则首先要将这些文件分发到hadoop集群上才可以进行计算;
hadoop提供了自动分发文件也压缩包的功能,只需要在启动hadoop streaming作业的时候增加响应的配置参数(-file)即可实现。

在执行streaming程序时,使用 -file 选项指定需要分发的本地文件;

1、本地文件分发(-file)

1.1、需求:wordcount(只统计指定的单词【the,and,had】)

思路:在之前的Wordcount中,是统计了所有文本内单词的Wordcount,在此基础上修改程序,增加一个类似白名单的文本wordwhite记录只统计的单词;在编写mapper程序时候,如果从文本获取的单词只有在wordwhite中的单词在输出map,进而传给reduce;reducer程序不需要修改;

1.2、程序和文件

  • wordwhite (只统计的单词)
$ vim wordwhite 
the
and
had
  • mapper程序
$ vim mapper.py 

#!/usr/bin/env python

import sys

def read_wordwhite(file):
    word_set = set()
    with open(file, 'r') as fd:
        for line in fd:
            word = line.strip()
            word_set.add(word)
    return word_set

def mapper(file_fd):
    word_set = read_wordwhite(file_fd)

    for line in sys.stdin:
        line = line.strip()
        words = line.split()
        for word in words:
            if word != "" and (word in word_set):
                print "%s\t%s" %(word, 1)

if __name__ == "__main__":
    if sys.argv[1]:
        file_fd = sys.argv[1]
        mapper(file_fd)
  • reducer程序
vim reducer.py 

#!/usr/bin/env python

import sys

def reducer():
    current_word = None
    word_sum = 0

    for line in sys.stdin:
        word_list = line.strip().split('\t')
        if len(word_list) < 2:
            continue
        word = word_list[0].strip()
        word_value = word_list[1].strip()

        if current_word == None:
            current_word = word
        if current_word != word:
            print "%s\t%s" %(current_word, str(word_sum))
            current_word = word
            word_sum = 0
        word_sum += int(word_value)

    print "%s\t%s" %(current_word, str(word_sum))

if __name__ == "__main__":
    reducer()
  • run_streaming程序
$ vim runstreaming.sh

#!/bin/bash

HADOOP_CMD="/home/hadoop/app/hadoop/hadoop-2.6.0-cdh5.13.0/bin/hadoop"
STREAM_JAR_PATH="/home/hadoop/app/hadoop/hadoop-2.6.0-cdh5.13.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0-cdh5.13.0.jar"

INPUT_FILE_PATH="/input/The_Man_of_Property"
OUTPUT_FILE_PATH="/output/wordcount/wordwhitetest"

# 
$HADOOP_CMD jar $STREAM_JAR_PATH \
                -input $INPUT_FILE_PATH \
                -output $OUTPUT_FILE_PATH \
                -mapper "python mapper.py wordwhite" \
                -reducer "python reducer.py" \
                -file ./mapper.py \
                -file ./reducer.py \
                -file ./wordwhite
  • 执行程序
    首先需要将测试的文件:The_Man_of_Property 上传到hdfs,同时创建wordcount输出目录;
$ hadoop fs -put ./The_Man_of_Property /input/
$ hadoop fs -mkdir /output/wordcount

注:本次hadoop环境是伪分布式,hadoop 2.6版本。

$ ./runstreaming.sh 
18/01/26 13:30:27 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapper.py, ./reducer.py, ./wordwhite, /tmp/hadoop-unjar7204532228900236640/] [] /tmp/streamjob7580948745512643345.jar tmpDir=null
18/01/26 13:30:29 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/01/26 13:30:29 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/01/26 13:30:31 INFO mapred.FileInputFormat: Total input paths to process : 1
18/01/26 13:30:31 INFO mapreduce.JobSubmitter: number of splits:2
18/01/26 13:30:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516345010544_0008
18/01/26 13:30:32 INFO impl.YarnClientImpl: Submitted application application_1516345010544_0008
18/01/26 13:30:32 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1516345010544_0008/
18/01/26 13:30:32 INFO mapreduce.Job: Running job: job_1516345010544_0008
18/01/26 13:30:40 INFO mapreduce.Job: Job job_1516345010544_0008 running in uber mode : false
18/01/26 13:30:40 INFO mapreduce.Job:  map 0% reduce 0%
18/01/26 13:30:50 INFO mapreduce.Job:  map 50% reduce 0%
18/01/26 13:30:51 INFO mapreduce.Job:  map 100% reduce 0%
18/01/26 13:30:58 INFO mapreduce.Job:  map 100% reduce 100%
18/01/26 13:30:59 INFO mapreduce.Job: Job job_1516345010544_0008 completed successfully
18/01/26 13:30:59 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=73950
        FILE: Number of bytes written=582815
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=636501
        HDFS: Number of bytes written=27
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=2
        Launched reduce tasks=1
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=12815
        Total time spent by all reduces in occupied slots (ms)=5251
        Total time spent by all map tasks (ms)=12815
        Total time spent by all reduce tasks (ms)=5251
        Total vcore-milliseconds taken by all map tasks=12815
        Total vcore-milliseconds taken by all reduce tasks=5251
        Total megabyte-milliseconds taken by all map tasks=13122560
        Total megabyte-milliseconds taken by all reduce tasks=5377024
    Map-Reduce Framework
        Map input records=2866
        Map output records=9243
        Map output bytes=55458
        Map output materialized bytes=73956
        Input split bytes=198
        Combine input records=0
        Combine output records=0
        Reduce input groups=3
        Reduce shuffle bytes=73956
        Reduce input records=9243
        Reduce output records=3
        Spilled Records=18486
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=332
        CPU time spent (ms)=3700
        Physical memory (bytes) snapshot=707719168
        Virtual memory (bytes) snapshot=8333037568
        Total committed heap usage (bytes)=598736896
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=636303
    File Output Format Counters 
        Bytes Written=27
18/01/26 13:30:59 INFO streaming.StreamJob: Output directory: /output/wordcount/wordwhitetest
  • 查看结果
$ hadoop fs -ls /output/wordcount/wordwhitetest/
Found 2 items
-rw-r--r--   1 centos supergroup          0 2018-01-26 13:30 /output/wordcount/wordwhitetest/_SUCCESS
-rw-r--r--   1 centos supergroup         27 2018-01-26 13:30 /output/wordcount/wordwhitetest/part-00000
$ hadoop fs -text /output/wordcount/wordwhitetest/part-00000
and 2573
had 1526
the 5144

以上就完成了指定单词的wordcount.

2、hadoop streaming 语法参考


本文转自 巴利奇 51CTO博客,原文链接:http://blog.51cto.com/balich/2065424