准备工作:

192.168.129.35上搭建一个Hadoop环境,早上已经搞定,所以不说了

可以参照附件的邮件 <technical>canton hadoop environment in 192.168.129.35

 

Step 1

下载并解压Hadoop 到本地机器(因为Eclipse需要这个Hadoop里面的部分Jar包充当其运行时)

这个Hadoop可以在http://hadoop.apache.org/ 官方下载,我下载了0.20.2版本,并且解压到D:\hadoop-0.20.2

 

Step 2

下载Hadoop Eclipse 插件并且放到Eclipse dropin 目录(或者plugin目录)

这个插件可以在\\192.168.0.238\Canton\Software\Eclipse_Plugins\Hadoop_Eclipse 目录下找到

 

 

Step 3:

重启Eclipse 后(或者STS,我就是用的Spring Tool Suite 反正都一样)

Window->Preference里面设置本地运行时,让其指向Step 1中的解压目录


 

Step 4:

打开MapReduce Tools 视图 (Window->Show View->MapReduce Tools->MapReduce Locations)

并编辑之:(这一步非常复杂,我搞错了N次才全设置正确,网上的设置例子都是在本机的,那种情况下本地账号和Hadoop账号一致,

而我们是相当于连接到远程192.168.129.35上的Hadoop服务器,当然了账号是不一样的(比如,我的开发机器是 charles.wang,远程 192.168.129.35 root

 

General面板里面设置如下


 

Advanced parameters面板里面,除了保持默认的设置外,有些需要改变:

·         dfs.data.dir 设置为/home/dcui/hadoop-0.20.2/tmp/dfs/data

·         dfs.name.dir 设置为/home/dcui/hadoop-0.20.2/tmp/dfs/name

·         dfs.name.edits.dir设置为/home/dcui/hadoop-0.20.2/tmp/dfs/name

·         dfs.replication设置为1

·         hadoop.tmp.dir设置为/home/dcui/hadoop-0.20.2/tmp

·         hadoop.job.ugi设置为root,Domain,Users,Remote,Desktop,Users,Users

 

 

Step 5

Project Explorer里面可以看到如下的内容:其中第二个/user/root/inputDirhadoop分布式文件系统目录,我们上午刚创建的


 

 

 

Step 6

下面就是开发HelloWorld程序了,用MapReduce Project 向导创建一个项目

 


 

完整的项目源代码见附件 HadoopWordCountDemo.zip

我其实什么都不懂,也就是按照API范例依葫芦画瓢搞了一个玩玩,它用于统计我们的canton_codetemplate.xml的关键字数目

 

 

Step 7:

配置运行选项,如图,这个项目需要传入2个参数,参数1Hadoop文件系统的用于被统计的文件位置,参数2目标输出目录来存放统计结果。我为了运行流畅,还把VM参数设大了点。

 

 

Step 8

观察控制台的输出如下(终于啊,我调试了1个小时。。)

 

12/03/12 14:52:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=

12/03/12 14:52:08 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

12/03/12 14:52:08 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

12/03/12 14:52:13 INFO input.FileInputFormat: Total input paths to process : 1

12/03/12 14:52:13 INFO mapred.JobClient: Running job: job_local_0001

12/03/12 14:52:13 INFO input.FileInputFormat: Total input paths to process : 1

12/03/12 14:52:13 INFO mapred.MapTask: io.sort.mb = 100

12/03/12 14:52:13 INFO mapred.MapTask: data buffer = 79691776/99614720

12/03/12 14:52:13 INFO mapred.MapTask: record buffer = 262144/327680

12/03/12 14:52:13 INFO mapred.MapTask: Starting flush of map output

12/03/12 14:52:14 INFO mapred.JobClient:  map 0% reduce 0%

12/03/12 14:52:14 INFO mapred.MapTask: Finished spill 0

12/03/12 14:52:14 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

12/03/12 14:52:14 INFO mapred.LocalJobRunner:

12/03/12 14:52:14 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.

12/03/12 14:52:14 INFO mapred.LocalJobRunner:

12/03/12 14:52:14 INFO mapred.Merger: Merging 1 sorted segments

12/03/12 14:52:14 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3278 bytes

12/03/12 14:52:14 INFO mapred.LocalJobRunner:

12/03/12 14:52:14 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting

12/03/12 14:52:15 INFO mapred.LocalJobRunner:

12/03/12 14:52:15 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now

12/03/12 14:52:15 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://192.168.129.35:9000/user/root/outputToThisFolder

12/03/12 14:52:15 INFO mapred.LocalJobRunner: reduce > reduce

12/03/12 14:52:15 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.

12/03/12 14:52:15 INFO mapred.JobClient:  map 100% reduce 100%

12/03/12 14:52:15 INFO mapred.JobClient: Job complete: job_local_0001

12/03/12 14:52:15 INFO mapred.JobClient: Counters: 14

12/03/12 14:52:15 INFO mapred.JobClient:   FileSystemCounters

12/03/12 14:52:15 INFO mapred.JobClient:     FILE_BYTES_READ=37570

12/03/12 14:52:15 INFO mapred.JobClient:     HDFS_BYTES_READ=7374

12/03/12 14:52:15 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=75880

12/03/12 14:52:15 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2803

12/03/12 14:52:15 INFO mapred.JobClient:   Map-Reduce Framework

12/03/12 14:52:15 INFO mapred.JobClient:     Reduce input groups=119

12/03/12 14:52:15 INFO mapred.JobClient:     Combine output records=119

12/03/12 14:52:15 INFO mapred.JobClient:     Map input records=28

12/03/12 14:52:15 INFO mapred.JobClient:     Reduce shuffle bytes=0

12/03/12 14:52:15 INFO mapred.JobClient:     Reduce output records=119

12/03/12 14:52:15 INFO mapred.JobClient:     Spilled Records=238

12/03/12 14:52:15 INFO mapred.JobClient:     Map output bytes=4501

12/03/12 14:52:15 INFO mapred.JobClient:     Combine input records=209

12/03/12 14:52:15 INFO mapred.JobClient:     Map output records=209

12/03/12 14:52:15 INFO mapred.JobClient:     Reduce input records=119

 

Step 9:

Hadoop分布式文件系统去检验输出

命令为:  hadoop fs –ls /user/root/outputToThisFolder

看到 Hadoop文件系统里面的 /user/root/outputToThisFolder 目录下面确实有一个文件,part-r-00000

 

我们打开看其内容

命令为:hadoop fs -cat /user/root/outputToThisFolder/part-r-00000


 

所以,这个文件的确按照关键字被进行了次数统计。