MapReduce--WordCount的另一种编写方式

原创

zhongqi2513 2023-04-03 14:34:00 ©著作权

文章标签 mapreduce main WordCount hadoop apache 文章分类 代码人生

©著作权归作者所有：来自51CTO博客作者zhongqi2513的原创作品，请联系作者获取转载授权，否则将追究法律责任

MapReduce的默认编写及调用方式

我们都知道，Hadoop是最早产生的用来解决大数据处理的开源框架。在不停的发展迭代过程中，现在的Hadoop已经发展成为具有四大核心组件的一个基础平台：

1、HDFS：Hadoop的分布式文件系统

2、MapReduce：分布式计算程序的编成框架

3、YARN：资源调度系统/分布式的操作系统

4、Common：Hadoop中的，以上三大组件的底层支撑，主要提供了各种基础工具包和RPC通信框架

对于其中的MapReduce，这是最早产生的可以用来解决海量数据的计算问题，MapReduce是一种编程模型，该模型的计算是分布式的，分成两个阶段，第一个阶段是Mapper阶段，主要用来做任务的切分和并发执行，第二个阶段是Reducer阶段，主要用来对第一个阶段的所有任务的结果做最终汇总。

前面，我已经介绍了关于MapReduce的入门程序WordCount的编写

MapReduce--1--入门程序WordCount

这是最普通的一种运行方式。

在此我也整理了一个关于MapReduce编程的系列教程

MapReduce编程案例系列篇

感兴趣的可以持续关注，会一直不断的更新。

但是最近，研究了一下关于HDFS集群的shell操作命令的原理。其实通过观察hadoop fs 这个命令的底层实现，可以得知，该命令的执行，其实就执行org.apache.hadoop.fs.FsShell这个类。所以咱们来看一下这个类的实现：

只贴核心代码：

public class FsShell extends Configured implements Tool {

	public static void main(String argv[]) throws Exception {
		FsShell shell = newShellInstance();
		Configuration conf = new Configuration();
		conf.setQuietMode(false);
		shell.setConf(conf);
		int res;
		try {
			res = ToolRunner.run(shell, argv);
		} finally {
			shell.close();
		}
		System.exit(res);
	}

	/**
	 * run
	 */
	@Override
	public int run(String argv[]) throws Exception {
		

		/**
		* 通过命令工厂，获取命令实例
		*/
		Command instance = commandFactory.getInstance(cmd);

		/**
		* 运行 命令实例的  run 方法 执行业务处理
		* hadoop fs -ls /
		*/
		exitCode = instance.run(Arrays.copyOfRange(argv, 1, argv.length));
		return exitCode;
	}

	// 忽略掉其他非重点代码
}

该类的入口是main方法，main方法中的核心代码是：

res = ToolRunner.run(shell, argv);

那咱们再来看run方法的实现：

public static int run(Tool tool, String[] args) throws Exception {
	return run(tool.getConf(), tool, args);
}

继续看其中调用的main方式的实现：

public static int run(Configuration conf, Tool tool, String[] args) throws Exception {
	if (conf == null) {
		conf = new Configuration();
	}
	GenericOptionsParser parser = new GenericOptionsParser(conf, args);
	//set the configuration back, so that Tool can configure itself
	tool.setConf(conf);

	//get the args w/o generic hadoop args
	String[] toolArgs = parser.getRemainingArgs();
	return tool.run(toolArgs);
}

通过以上的调用关系可以看出ToolRunner.run(shell, argv)，其实它的内部就是调用 shell.run(argv)

说白了，就是通过一个工具类来调用当前FsShell的执行，首先从main方法进入，然后兜转一圈，又回到调用FsShell实例对象的run方法来了。

所以，我们可以据此改写我们的第一个MapReduce程序WordCount：

编写一个类WordCountMRToolRunner extends Configured implements Tool即可。然后我们把之前写入到main方法中的MapReduce的驱动程序代码写到run方法中即可。

然后依然由main方法作为程序入口。

下面是具体的代码实现：

package com.ghgj.mazh.mapreduce.wc.demo4;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * 作者： 马中华
 * 日期： 2017年10月28日下午2:48:06
 * 
 * 描述：
		此种WordCount的编写方式来自于hadoop fs 这个shell命令
 */
public class WordCountMRToolRunner extends Configured implements Tool {

	/**
	 * 日期：2017年10月28日下午3:13:42
	 * 
	 * 程序的入口。
	 */
	public static void main(String[] args) throws Exception {

		int run = ToolRunner.run(new WordCountMRToolRunner(), args);
		System.exit(run);
	}

	/**
	 * 观察main方法中的核心代码： ToolRunner中的run方法的实现即可得知：最后会调用第一个参数对象的run方法。
	 * 但事实上，run方法的第一个参数就是当前类的一个实例对象。所以，main方法最后会调用这个run方法执行。
	 */
	@Override
	public int run(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);

		job.setJarByClass(WordCountMRToolRunner.class);

		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);

		FileInputFormat.setInputPaths(job, new Path("d:\\bigdata\\wordcount\\input"));
		Path outputPath = new Path("d:\\bigdata\\wordcount\\output");
		FileSystem fs = FileSystem.get(conf);
		if (fs.exists(outputPath)) {
			fs.delete(outputPath, true);
		}
		FileOutputFormat.setOutputPath(job, outputPath);

		boolean waitForCompletion = job.waitForCompletion(true);
		return waitForCompletion ? 0 : 1;
	}

	/**
	 * 作者： 马中华 
	 * 日期： 2017年10月28日下午2:47:19
	 * 
	 * 描述：
			Mapper组件
	 */
	public static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

		@Override
		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

			String[] splits = value.toString().split(" ");
			for (int i = 0; i < splits.length; i++) {
				context.write(new Text(splits[i]), new IntWritable(1));
			}
		}
	}

	/**
	 * 作者： 马中华 
	 * 日期： 2017年10月28日下午2:47:38
	 * 
	 * 描述：
			Reducer组件
	 */
	public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

		@Override
		protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

			int result = 0;
			for (IntWritable iw : values) {
				result += iw.get();
			}
			context.write(key, new IntWritable(result));
		}
	}
}

代码的运行结果就不贴了。最终，代码也是能顺利执行成功的。

希望对大家有用。ヾ(◍°∇°◍)ﾉﾞ