WordCount是学习Hadoop的经典入门范例。下面通过一步步的操作,来编译、打包、运行WordCount程序。
1、在Hadoop 1.0.4的解压目录的如下位置可以找到WordCount.java的源文件
src/examples/org/apache/hadoop/examples/WordCount.java
2、新建一个dev的文件夹,将WordCount.java拷贝至dev/wordcount文件夹下

ubuntu@ubuntu:~/dev/wordcount$ pwd
/home/ubuntu/dev/wordcount
ubuntu@ubuntu:~/dev/wordcount$ ls
bin  compile.txt  WordCount.java

3、在dev/wordcount文件夹下创建一个bin文件夹,并将编译WordCount.java得到的class文件生成至bin文件夹下

javac -classpath /home/ubuntu/hadoop-1.0.4/hadoop-core-1.0.4.jar:/home/ubuntu/hadoop-1.0.4/lib/commons-cli-1.2.jar -d bin WordCount.java

4、将生成的class文件打包成jar包

jar -cvf  WordCount.jar  *.class

5、在bin下新建一个input文件夹,并生成两个输入文件

ubuntu@ubuntu:~/dev/wordcount/bin/input$ ls
words-1.txt  words-2.txt
ubuntu@ubuntu:~/dev/wordcount/bin/input$ cat words-1.txt
i am a student!
how are you?
my name is lily.
ubuntu@ubuntu:~/dev/wordcount/bin/input$ cat words-2.txt
i am a student!
how are you?
she is lily
he is my brother
ubuntu@ubuntu:~/dev/wordcount/bin/input$

6、在hdfs上创建input和output文件夹,并将两个输入文件上传至input文件夹

ubuntu@ubuntu:~/dev/wordcount/bin$ hadoop fs -mkdir /tmp/input
ubuntu@ubuntu:~/dev/wordcount/bin$ hadoop fs -mkdir /tmp/output
ubuntu@ubuntu:~/dev/wordcount/bin/input$ hadoop fs -put words-1.txt /tmp/input
ubuntu@ubuntu:~/dev/wordcount/bin/input$ hadoop fs -put words-2.txt /tmp/input

7、运行WordCount程序

ubuntu@ubuntu:~/dev/wordcount/bin$ hadoop jar WordCount.jar WordCount /tmp/input /tmp/output/result
13/01/24 08:09:37 INFO input.FileInputFormat: Total input paths to process : 2
13/01/24 08:09:38 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/01/24 08:09:38 WARN snappy.LoadSnappy: Snappy native library not loaded
13/01/24 08:09:38 INFO mapred.JobClient: Running job: job_201301240711_0003
13/01/24 08:09:39 INFO mapred.JobClient:  map 0% reduce 0%
13/01/24 08:10:13 INFO mapred.JobClient:  map 100% reduce 0%
13/01/24 08:10:34 INFO mapred.JobClient:  map 100% reduce 100%
13/01/24 08:10:39 INFO mapred.JobClient: Job complete: job_201301240711_0003
13/01/24 08:10:39 INFO mapred.JobClient: Counters: 29
13/01/24 08:10:39 INFO mapred.JobClient:   Job Counters
13/01/24 08:10:39 INFO mapred.JobClient:     Launched reduce tasks=1
13/01/24 08:10:39 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=56253
13/01/24 08:10:39 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/01/24 08:10:39 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/01/24 08:10:39 INFO mapred.JobClient:     Launched map tasks=2
13/01/24 08:10:39 INFO mapred.JobClient:     Data-local map tasks=2
13/01/24 08:10:39 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=18108
13/01/24 08:10:39 INFO mapred.JobClient:   File Output Format Counters
13/01/24 08:10:39 INFO mapred.JobClient:     Bytes Written=96
13/01/24 08:10:39 INFO mapred.JobClient:   FileSystemCounters
13/01/24 08:10:39 INFO mapred.JobClient:     FILE_BYTES_READ=251
13/01/24 08:10:39 INFO mapred.JobClient:     HDFS_BYTES_READ=320
13/01/24 08:10:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=65235
13/01/24 08:10:39 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=96
13/01/24 08:10:39 INFO mapred.JobClient:   File Input Format Counters
13/01/24 08:10:39 INFO mapred.JobClient:     Bytes Read=104
13/01/24 08:10:39 INFO mapred.JobClient:   Map-Reduce Framework
13/01/24 08:10:39 INFO mapred.JobClient:     Map output materialized bytes=257
13/01/24 08:10:39 INFO mapred.JobClient:     Map input records=7
13/01/24 08:10:39 INFO mapred.JobClient:     Reduce shuffle bytes=257
13/01/24 08:10:39 INFO mapred.JobClient:     Spilled Records=48
13/01/24 08:10:39 INFO mapred.JobClient:     Map output bytes=204
13/01/24 08:10:39 INFO mapred.JobClient:     CPU time spent (ms)=7650
13/01/24 08:10:39 INFO mapred.JobClient:     Total committed heap usage (bytes)=247275520
13/01/24 08:10:39 INFO mapred.JobClient:     Combine input records=25
13/01/24 08:10:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=216
13/01/24 08:10:39 INFO mapred.JobClient:     Reduce input records=24
13/01/24 08:10:39 INFO mapred.JobClient:     Reduce input groups=15
13/01/24 08:10:39 INFO mapred.JobClient:     Combine output records=24
13/01/24 08:10:39 INFO mapred.JobClient:     Physical memory (bytes) snapshot=301699072
13/01/24 08:10:39 INFO mapred.JobClient:     Reduce output records=15
13/01/24 08:10:39 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1129721856
13/01/24 08:10:39 INFO mapred.JobClient:     Map output records=25
ubuntu@ubuntu:~/dev/wordcount/bin$

8、查看运行结果

ubuntu@ubuntu:~$ hadoop fs -ls /tmp/output/result
Found 3 items
-rw-r--r--   1 ubuntu supergroup          0 2013-01-24 08:10 /tmp/output/result/_SUCCESS
drwxr-xr-x   - ubuntu supergroup          0 2013-01-24 08:09 /tmp/output/result/_logs
-rw-r--r--   1 ubuntu supergroup         96 2013-01-24 08:10 /tmp/output/result/part-r-00000
ubuntu@ubuntu:~$ hadoop fs -cat /tmp/output/result/part-r-00000
a    2
am    2
are    2
brother    1
he    1
how    2
i    2
is    3
lily    1
lily.    1
my    2
name    1
she    1
student!    2
you?    2
ubuntu@ubuntu:~$