五个关于mapreduce的简单程序实现

mapreduce的简介

什么是mapreduce?
是一种分布式运算程序
一个mapreduce程序会分成两个阶段,map阶段和reduce阶段
map阶段会有一个实体程序,不用用户自己开发
reduce阶段也会有一个实体程序,不用用户自己开发
用户只需要开发map程序和reduce程序所要调用的数据处理逻辑方法
Map阶段的逻辑方法:xxxMapper.map()
Reduce阶段的逻辑方法:xxxReducer.reduce()

map阶段

框架中的map程序如何调用用户写的map()方法的?
map程序每读取一行数据,就调用一次map()方法,而且会将这一行数据的起始偏移量作为key,这一行的内容作为value,作为参数传给map(key,value,context)方法

reduce阶段

框架中的reduce程序如何调用用户写的reduce()方法的?
reduce程序会收到来自map程序输出的中间结果数据,而且相同key的数据会到达同一个reduce程序实例,比如reduce程序实例0可能会收到这样的一些数据。
A:1 A:1 A:1 C:1 C:1 C:1 X:1 X:1 X:1
reduce程序会将自己收到的程序按照key整理成一组一组,对一组数据调用一次reduce()方法来处理一次,并且会将数据作为参数传给reduce(key,迭代器values,context)方法。

mapreduce的运行机制

mapreduce程序如何运行?
mapreduce程序可以作为单机版程序在本地运行
mapreduce程序更应该作为分布式程序提交给yarn去运行
写一个yarn的客户端类(含main方法)
指定job的jar包所在的路径
指定job所要的mapper类reducer类以及map、reduce阶段输出到key、value的数据类型
指定job要处理的数据所在目录
指定job输出结果所在目录
然后用一个方法:waitForCompletion()向yarn的resourcemanager提交job即可

启动命令

启动客户端类:
用java-cp可以启动,但是需要手动设置大量的jar包和配置文件到classpath,不建议。

建议用hadoop jar pv.jar命令来启动,hadoop命令会自动设置好classpath(将hadoop安装目录中的所有jar包和配置文件加入到classpath中)

准备环境

mr程序运行环境准备----yarn集群配置和启动。
在/usr/hadoop/hadoop-2.7.7/etc/hadoop目录下的yarn-site.xml配置文件中。配置如下:

<configuration>
	<property>
		<name>yarn.resourcemanager.hostname</name>
		<value>hadoop01</value>
	</property>
	<property>
		<name>yarn.nodemanager.aux-services </name>
		<value>mapreduce_shuffle</value>
	</property> 
</configuration>

yarn在启动的时候,也会看slaves中配置的机器,这些机器会作为yarn的nodemanager。
这样的话,nodemanager和datanode正好是相同的机器。
mv mapred-site.xml.template mapred-site.xml
vi mapred-site.xml

<configuration>
	<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>
</configuration>

mapreduce框架的程序可以以单机模式运行(本地模式),我们要让它分布式运行,就交给yarn,以分布式运行。

scp yarn-site.xml mapred-site.xml hadoop02:$PWD
……

启动yarn集群,start-yarn.sh jps查看

yarn的启动与hdfs的启动无关,但是mapreduce程序肯定要访问hdfs中的数据,所以开启yarn后再开启hdfs。start-dfs.sh

第一个程序

统计用户的访问量

Map端:
Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
KEYIN是mr框架提供的程序读取到一行数据的起始偏移量
VALUEIN是mr框架提供的程序读取到一行数据的内容

KEYOUT是用户的逻辑处理方法处理完之后返回给框架的数据中的KEY的类型
VALUEOUT是用户的逻辑处理方法处理完之后返回给框架的数据中的VALUE的类型
Long、String、Integer一类的java类型不能再hadoop中直接使用,因为这些数据会被框架在机器和机器之间进行网络传送,也就是说,数据需要频繁的序列化和反序列化,而java原生的序列化和反序列化机制非常的臃肿,所以hadoop开发了一个自己的序列化机制。
Long—LongWritable
String—Text //导包要对,hadoop.io Integer—IntWritable

public class PvMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
	
	/**
	 * MR框架提供的程序每读一行数据就调用一次我们写的这个map方法
	 */
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		//拿到一行
		String line = value.toString();
		String[] split = line.split(" ");
		//切出ip地址
		String ip = split[0];
		
		//通过context返回结果
		context.write(new Text(ip), new IntWritable(1));
	}
}

Reduce端:
KEYIN和VALUEIN 对应的是map阶段输出的数据的key和value的类型
KEYOUT和VALUEOUT是用户的reduce阶段的逻辑处理结果中的key和value的类型

public class PvReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
	/**
	 * MR框架提供的reduce端在整理好一组相同key的数据后,调用reduce方法
	 */
	@Override
	protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
		 int count = 0;
		 for(IntWritable value:values){
			 count += value.get();
		 }
		 context.write(key, new IntWritable(count));
	}
}

Job端:
JobSubmitter类其实是一个yarn的客户端。功能就是:将我们的mapreduce程序jar包提交给yarn,让yarn再去将jar包分发到很多的nodemanager上去执行

public class JobSubmitter {
	public static void main(String[] args) throws Exception {
		
		//新建一个job,封装任务的信息
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		
		job.setJar("/root/pv.jar");
		
		job.setMapperClass(PvMapper.class);
		job.setReducerClass(PvReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
			
		//用一个什么样的组件去读,导包lib.input
		job.setInputFormatClass(TextInputFormat.class);
		//告诉组件去哪读
		FileInputFormat.setInputPaths(job, new Path(""));
		
		job.setOutputFormatClass(TextOutputFormat.class);
		//告诉组件结果写到哪
		FileOutputFormat.setOutputPath(job, new Path(""));
		
		//向yarn提交job,提交到nodemaneger中去执行任务
		//传true,会在客户端打印集群运行的进度信息
		boolean res = job.waitForCompletion(true);
		System.exit(res?0:1);
	}
}

第二个程序

统计单词的个数。

Map端:

public class WcMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		String line = value.toString();
		String[] words = line.split(" ");
		for (String word : words) {
			context.write(new Text(word), new IntWritable(1));
		}
	}
}

Reduce端:

public class WcReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,
			Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
		int count = 0;
		for (IntWritable value : values) {
			count += value.get();
		}
		context.write(key, new IntWritable(count));
		
	}
}

Job端:

public class WcJobSubmitter {
	public static void main(String[] args) throws Exception {
		
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		
		job.setJar("/root/pv.jar");
		
		job.setMapperClass(WcMapper.class);
		job.setReducerClass(WcReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
			
		job.setInputFormatClass(TextInputFormat.class);
		FileInputFormat.setInputPaths(job, new Path("/wc/input"));
		
		job.setOutputFormatClass(TextOutputFormat.class);
		FileOutputFormat.setOutputPath(job, new Path("/wc/output"));
		
		boolean res = job.waitForCompletion(true);
		System.exit(res?0:1);
	}
}

1、 将工程打包成jar,上传到hadoop主机上。
2、 在hdfs上创建文件存放的目录。hadoop fs -mkdir -p /wc/input。
3、hadoop fs -put qingshu.txt /wc/input 将要统计的文件上传到hdfs指定的目录上。
4、执行job。hadoop jar pv.jar cn.jixiang.mr.wc.WcJobSubmitter
5、查看结果。hadoop fs -ls /wc/output hadoop fs -cat /wc/output/part-r-00000

补充:
//设置reduce的个数

job.setNumReduceTasks(4);

第三个程序

hadoop中的数据会频繁的实现序列化和反序列化
所以自定义的类型:FlowBean必须要实现hadoop序列化接口

Map端:

public class FlowSumMapper extends Mapper<LongWritable, Text, Text, FlowBean>{
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		String line = value.toString();
		String[] split = line.split("\t");
		String phone = split[1].trim();
		long upflow = Long.parseLong(split[split.length-3]);
		long downflow = Long.parseLong(split[split.length-2]);
		
		FlowBean flowBean = new FlowBean(upflow, downflow);
		context.write(new Text(phone), flowBean);
	}
}

Reduce端:

public class FlowSumReducer extends Reducer<Text, FlowBean, Text, FlowBean>{
	@Override
	protected void reduce(Text key, Iterable<FlowBean> values, Reducer<Text, FlowBean, Text, FlowBean>.Context context)
			throws IOException, InterruptedException {
		long upflowSum = 0;
		long downfolwSum = 0;
		for (FlowBean flowBean : values) {
			upflowSum += flowBean.getUpflow();
			downfolwSum += flowBean.getDownflow();
		}
		context.write(key, new FlowBean(upflowSum,downfolwSum));
	}
}

Bean端:

public class FlowBean implements Writable{
	private long upflow;
	private long downflow;
	private long sumflow;
	
	//注意显示定义一个空参构造函数
	public FlowBean() {
		super();
	}
	public FlowBean(long upflow, long downflow) {
		super();
		this.upflow = upflow;
		this.downflow = downflow;
		this.sumflow = upflow+downflow;
	}
	public long getUpflow() {
		return upflow;
	}
	public void setUpflow(long upflow) {
		this.upflow = upflow;
	}
	public long getDownflow() {
		return downflow;
	}
	public void setDownflow(long downflow) {
		this.downflow = downflow;
	}
	
	public long getSumflow() {
		return sumflow;
	}
	public void setSumflow(long sumflow) {
		this.sumflow = sumflow;
	}
	/**
	 * hadoop序列化框架在 反序列化时调用的方法
	 * @throws IOException 
	 */
	public void readFields(DataInput in) throws IOException {
		this.upflow = in.readLong();
		this.downflow = in.readLong();
		this.sumflow = in.readLong();
	}	
	/**
	 * hadoop序列化框架在序列化时调用的方法
	 * @throws IOException 
	 */
	public void write(DataOutput out) throws IOException {
		out.writeLong(upflow);
		out.writeLong(downflow);
		out.writeLong(sumflow);
	}

	@Override
	public String toString() {
		return upflow+"\t"+downflow+"\t"+sumflow;
	}
}

Job端:

public class FlowSumJobSubmit {
	public static void main(String[] args) throws Exception {
		//新建一个job,封装任务的信息
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		
		//job.setJar("/root/pv.jar");
		job.setJarByClass(FlowSumJobSubmit.class);
		
		job.setMapperClass(FlowSumMapper.class);
		job.setReducerClass(FlowSumReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
			
		//用一个什么样的组件去读,导包lib.input
		job.setInputFormatClass(TextInputFormat.class);
		//告诉组件去哪读
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		
		job.setOutputFormatClass(TextOutputFormat.class);
		//告诉组件结果写到哪
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		//向yarn提交job,提交到nodemaneger中去执行任务
		//传true,会在客户端打印集群运行的进度信息
		boolean res = job.waitForCompletion(true);
		System.exit(res?0:1);
	}
}

第四个程序

(在第三个程序的基础上改进,实现分省流量统计)
将相同省的数据放到同一个reduce中。之前分数据的方法是key的hashcode%reducetasknum,所以改变它的规则。

public class ProvincePartitioner extends Partitioner<Text, FlowBean>{

	private static HashMap<String,Integer> provinceCode = new HashMap<>();
	static{
		provinceCode.put("135", 0);
		provinceCode.put("136", 1);
		provinceCode.put("137", 2);
		provinceCode.put("138", 3);
	}
	
	@Override
	public int getPartition(Text key, FlowBean value, int numPartitions) {
		Integer code = provinceCode.get(key.toString().substring(0,3));
		return code==null?4:code;
	}
}

重写job方法

public class JobSubmitter {
	public static void main(String[] args) throws Exception {
		
		//新建一个job,封装任务的信息
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		
		job.setJar("/root/pv.jar");
		
		job.setMapperClass(PvMapper.class);
		job.setReducerClass(PvReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
			
		//用一个什么样的组件去读,导包lib.input
		job.setInputFormatClass(TextInputFormat.class);
		//告诉组件去哪读
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		
		job.setOutputFormatClass(TextOutputFormat.class);
		//告诉组件结果写到哪
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		job.setPartitionerClass(ProvincePartitioner.class);
		job.setNumReduceTasks(Integer.parseInt(args[2]));
		
		//向yarn提交job,提交到nodemaneger中去执行任务
		//传true,会在客户端打印集群运行的进度信息
		boolean res = job.waitForCompletion(true);
		System.exit(res?0:1);
	}
}

第五个程序

(统计每个单词,对应的文件和个数)
先以:单词-文件名作为key,以单词的次数作为value
再以:单词作为key,以文件名-次数作为value
步骤一:

public class IndexStepOne {
	public static class IndexStepOneMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
		String fileName;
		/**
		 * 一个map程序运行实例在调用我们的自定义mapper逻辑类时,首先会调用一次setup方法,只会调用一次
		 */
		@Override
		protected void setup(Context context)
				throws IOException, InterruptedException {
			//想生成:     key:单词-文件名      value:1
			FileSplit inputSplit = (FileSplit) context.getInputSplit();
			fileName = inputSplit.getPath().getName();
			
		}		
		@Override
		protected void map(LongWritable key, Text value,Context context)
				throws IOException, InterruptedException {
			String line = value.toString();
			String[] words = line.split(" ");
			for (String word : words) {
				context.write(new Text(word+"-"+fileName), new IntWritable(1));
			}			
		}
		
		/**
		 * 当一个map程序实例在处理完自己负责的整个切片数据后,会调用一次cleanup方法。
		 */
		@Override
		protected void cleanup(Mapper<LongWritable, Text, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			
		}
	}
	
	public static class IndexStepOneReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
		@Override
		protected void reduce(Text key, Iterable<IntWritable> values,
				Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
			int count = 0;
			for (IntWritable value : values) {
				count += value.get();
			}
			context.write(key, new IntWritable(count));
		}
	}
	
	public static void main(String[] args) throws Exception {

		
		//新建一个job,封装任务的信息
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		
		job.setJarByClass(IndexStepOne.class);
		
		job.setMapperClass(IndexStepOneMapper.class);
		job.setReducerClass(IndexStepOneReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
			
		//用一个什么样的组件去读,导包lib.input
		job.setInputFormatClass(TextInputFormat.class);
		//告诉组件去哪读
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		
		job.setOutputFormatClass(TextOutputFormat.class);
		//告诉组件结果写到哪
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		//重置它的分区方式
		//job.setPartitionerClass(ProvincePartitioner.class);
		job.setNumReduceTasks(Integer.parseInt(args[2]));
		
		//向yarn提交job,提交到nodemaneger中去执行任务
		//传true,会在客户端打印集群运行的进度信息
		boolean res = job.waitForCompletion(true);
		System.exit(res?0:1);
	
	}
}

步骤二:

public class IndexStepSecond {
	public static class IndexStepSecondMapper extends Mapper<LongWritable, Text, Text, Text>{
		@Override
		protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
				throws IOException, InterruptedException {
			
			String line = value.toString();
			String[] split = line.split("-");
			String word = split[0];
			String[] temp = split[1].split("\t");
			String fileName = temp[0];
			String count = temp[1];
			context.write(new Text(word), new Text(fileName+"-->"+count));
		}
	}
	
	public static class IndexStepSecondReducer extends Reducer<Text, Text, Text, Text>{
		@Override
		protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
				throws IOException, InterruptedException {
			StringBuilder sb = new StringBuilder();
			for (Text value : values) {
				sb.append(value.toString()).append(" ");
			}
			context.write(key, new Text(sb.toString()));
		}
	}
	
	public static void main(String[] args) throws Exception {

		//新建一个job,封装任务的信息
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		
		job.setJarByClass(IndexStepSecond.class);
		
		job.setMapperClass(IndexStepSecondMapper.class);
		job.setReducerClass(IndexStepSecondReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);
		
		//如果map阶段输出的kv类型和reduce阶段输出的kv类型完全一致,上面两行可以不写
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
			
		//用一个什么样的组件去读,导包lib.input
		//job.setInputFormatClass(TextInputFormat.class);
		//告诉组件去哪读
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		
		//job.setOutputFormatClass(TextOutputFormat.class);
		//告诉组件结果写到哪
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		//重置它的分区方式
		//job.setPartitionerClass(ProvincePartitioner.class);
		job.setNumReduceTasks(Integer.parseInt(args[2]));
		
		//向yarn提交job,提交到nodemaneger中去执行任务
		//传true,会在客户端打印集群运行的进度信息
		boolean res = job.waitForCompletion(true);
		System.exit(res?0:1);

	}
}