Mapreduce编程思想 mapreduce基本实现思路

转载

mob6454cc7c0428 2024-04-04 09:27:22

文章标签 Mapreduce编程思想数据 HDFS hdfs 文章分类 架构后端开发

Map的输出是key,value的 list
Reduce的输入是key。value的list
MapReduce核心思想
分而治之，先分后和;
MapReduce是hadoop提供的一个分布式运算框架
1将任务分为两个阶段执行
第一阶段:map阶段:(3台机器)
读取数据自己节点的任务数据,处理数据,根据key的hashcode%n的值决定输出结果的位置

第二阶段:reduce阶段:(2台机器)
根据自己的任务编号处理对应的map产生的中间结果文件,最终统计出全局的数据结果
2移动运算优于移动数据(直接在本地处理数据,不经过网络)
MapReduce技能运行在本地,也能运行在yarn上,运行在yarn上,也就是说分布式运行,maptask和reducetask可以分别运行在不同的机器上,可以人为指定maptask和reducetask运行在指定的机器上或者由yarn去分配.

Mapreduce编程思想 mapreduce基本实现思路_hdfs

代码实现

map阶段:

/**
• 单词统计
• 
• 根据自己的任务编号处理不同范围的数据 1 读取的是HDFS的数据 2 处理行数据 line line.split(" ") 单词 --> (单词 ,1)
• (a,1) (a,1) (b,1) 3 根据单词的hashcode%reducetask的个数 = 根据结果数据写到不同的文件中
• 
*/
 public class MapTask {
 public static void main(String[] args) throws Exception {
 // 1 接收名执行输入的四个参数
 String path = args[0];// 处理数据的路径
 long start = Long.parseLong(args[1]); // 起始位置
 long length = Long.parseLong(args[2]); // 长度
 String taskId = args[3]; // 任务标号
 // 2 根据自己的任务编号和任务的范围读取数据
 // 2.1 获取操作HDFS的客户端对象
 Configuration conf = new Configuration();
 FileSystem fs = FileSystem.get(new URI(“hdfs://doit01:9000”), conf, “root”);
 // 创建两个输出对象 0 1

FSDataOutputStream out0 = fs.create(new Path("/data/wc/map_output/res_m_" + taskId + "_0"));// res_m_0_0
	FSDataOutputStream out1 = fs.create(new Path("/data/wc/map_output/res_m_" + taskId + "_1"));// res_m_0_1
	// 2.2 读数据 获取任务数据的输入流
	FSDataInputStream fin = fs.open(new Path(path));
	// 跳到自己的数据任务开始的位置
	fin.seek(start);
	// 2.3 转换成字符流方便处理
	BufferedReader br = new BufferedReader(new InputStreamReader(fin));
	// 丢弃第一行数据
	if (start !=  0) {
		br.readLine();
	}
	String line = null;
	long len = 0;
	while ((line = br.readLine()) != null) {
		len += line.length() + 2; // 在windows中的换行是\r\n
		// 处理这一行数据
		String[] words = line.split(" ");
		for (String word : words) {
			// 写出去 word.hashcode()%2 0 1
			if (word.hashCode() % 2 == 0) { // 写到0文件中 HDFS中
				//
				out0.writeUTF(word + "\t" + 1+"\n");// a 1 a 1
			} else { // 写到1 文中
				out1.writeUTF(word + "\t" + 1+"\n");
			}
		}
		// 跳出   读取属于自己的数据
		if (len > length) {// 128M 
			break;
		}

	}
     //释放资源
	out0.close();
	out1.close();
	br.close();
	fin.close();
	fs.close();

}

}
reduce阶段
 /**• 读取属于自己的map端产生的数据
• 统计最终的结果
• @author ThinkPad
• 
*/
 public class ReduceTask {

public static void main(String[] args) throws Exception {
	Map<String , Integer> map =  new HashedMap() ;
	
	String  taskId  = args[0] ;// 0   1
	//  获取操作hdfs 的客户端对象
	Configuration conf = new Configuration();
	FileSystem fs = FileSystem.get(new URI("hdfs://doit01:9000"), conf, "root");
	// 遍历文件夹下的所有的文件   map端产生数据的文件夹
	RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path("/data/wc/map_output"), false);
	while(listFiles.hasNext()) {
		LocatedFileStatus file = listFiles.next();
		Path path = file.getPath();
		// 获取文件名
		String name = path.getName();
		// 根据文件名的后一个字母来判断要处理的文件
		if(name.endsWith(taskId)) { // 处理
			// 读取文件
			FSDataInputStream fin = fs.open(path);
			BufferedReader br = new BufferedReader(new InputStreamReader(fin));
			String line = null ;
			while((line = br.readLine())!=null) {
				String[] split = line.split("\t") ; //
				String word = split[0] ;// 
				map.put(word, map.getOrDefault(word, 0)+1) ;
			}
			br.close();
			fin.close();
		}
	}
	
	//  所有的结果数据在map中  
	FSDataOutputStream out = fs.create(new  Path("/data/wc/reduce_out/res_r_"+taskId));
	
  Set<Entry<String,Integer>> entrySet = map.entrySet();
  for (Entry<String, Integer> entry : entrySet) {
	  out.writeUTF(entry.getKey()+"\t"+entry.getValue()+"\n");//换行
   }
  out.flush();
  out.close();
  fs.close();
	
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。