1、通用元素

select 字段:Map里的value值。Reduce不做处理,遍历输出组内每一元素。

2、order by全局排序

  1. order by : 排序字段当做Map的key,Map中会自动分区、排序。
  2. 全局:1个Reduce,默认就是1个Reduce
protected  void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
				String line=value.toString();
				String[] words=line.split("\001");
				//1)order by的字段抽取当做key,
				//2)select *:整行(全字段)当做value
				context.write(new Text(words[0]),new Text(line));
			}
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
				//1)Reduce输出每一个元素
				for(Text line:values){
					context.write(key,line);
				}

			}

输出结果:

"P172E10-SN-01""20150130172902677003014654832192XASJ1"
					"P172E10-SN-01""20150130203618099002901408624547XASJ1"
					"P172E10-SN-01""20150130203726033002901408624247XASJ1"
					"P172E10-SN-01""20150130203851342002901408624846XASJ1"
					"P172E10-SN-01""20150130204711874002901408624669XASJ1"
					"P839N31""20150131152252599003014654832409XASJ1"
					"P839N31""20150131173954764003014654832951XASJ1"
					"P839N31""20150131200923817000040229744422XASJ1"
					"P839N31""20150131204500648000040229744318XASJ1"
					"P172E10-SN-01""20150131205316663000040229744636XASJ1"
					"P839N31""20150131205316663000040229744636XASJ1"
					"MOBILENAME""TESTRECORDID"
					"MOBILENAME""TESTRECORDID"

3、distribute by order by 局部排序

  1. distribute by:自定义分区类,按distribute键分区。
public static class UdfPartitioner extends Partitioner<Text,Text>{
				@Override
				public int getPartition(Text key, Text val, int i) {
					Integer partn=0;
					String mname=val.toString().split("\001")[0];
					if (mname.equals("\"P172E10-SN-01\"")) partn=1;
					if (mname.equals("\"P839N31\"")) partn=2;
					return partn;
				}
			}
  1. job指定分区类:
job.setPartitionerClass(UdfPartitioner.class);
  1. job指定reducer个数:要与distribute键候选值个数一样。
job.setNumReduceTasks(3); //reducer个数可以比distribute键值多;不能少,否则map的输出没有接收,会报错。
  1. 怎么自主知道分区个数?2个mr?
C:\Windows\System32>hdfs dfs -cat /outputOrder095143/part-r-00000
		"TESTRECORDID"  "MOBILENAME""TESTRECORDID"
		"TESTRECORDID"  "MOBILENAME""TESTRECORDID"
		C:\Windows\System32>hdfs dfs -cat /outputOrder095143/part-r-00001
		"20150130172902677003014654832192XASJ1" "P172E10-SN-01""20150130172902677003014654832192XASJ1"
		"20150130203618099002901408624547XASJ1" "P172E10-SN-01""20150130203618099002901408624547XASJ1"
		"20150130203726033002901408624247XASJ1" "P172E10-SN-01""20150130203726033002901408624247XASJ1"
		"20150130203851342002901408624846XASJ1" "P172E10-SN-01""20150130203851342002901408624846XASJ1"
		"20150130204711874002901408624669XASJ1" "P172E10-SN-01""20150130204711874002901408624669XASJ1"
		"20150131205316663000040229744636XASJ1" "P172E10-SN-01""20150131205316663000040229744636XASJ1"

4、distinct

  1. 把distinct键当做map的键,值可以传空NullWritable。
public static class WCLocalMapper extends Mapper<LongWritable,Text,Text,NullWritable>{
				@Override
				protected  void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
					String line=value.toString();
					String[] words=line.split("\001");
					//distinct键当做map的键,值传入NullWritable
					context.write(new Text(words[0]),NullWritable.get());
				}
			}
  1. reduce中直接输出key值即可,值可以传空NullWritable。
public static class WCLocalReducer extends Reducer<Text,NullWritable,Text,NullWritable>{
				@Override
				protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
					//直接输出key
					context.write(key,NullWritable.get());

				}
			}
  1. Mapper、Reducer中的泛型输入输出要与Job里设定的Mapper、Output一致。
job.setMapOutputValueClass(NullWritable.class);
			job.setOutputValueClass(NullWritable.class);

5、count(distinct)

法1:全部相同的key值,1个reduce

缺点:无并行,效率低.

public static class WCLocalMapper extends Mapper<LongWritable,Text,Text,Text>{
				@Override
				protected  void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
					String line=value.toString();
					String[] words=line.split(",");
					//map输出的key值全部相同
					context.write(new Text("key"),new Text(words[6]));
				}
			}
			public static class WCLocalReducer extends Reducer<Text,Text,NullWritable,IntWritable>{
				@Override
				protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
					//用HashSet去重
					HashSet<String> hs=new HashSet<String>();
					for(Text line:values){
						hs.add(line.toString());
					}
					//输出HashSet的元素个数
					context.write(NullWritable.get(),new IntWritable(hs.size()));

				}
			}

法2:2个job,先输出disticnt,再count

  1. distinct:把distinct键当做map的键,值可以传空NullWritable。
  2. count:把第1个job的输出目录当做第2个目录的输入目录,仅1个reduce,统计key里传进来的val的个数。
  3. 两个job的衔接
if(job.waitForCompletion(true)){
                System.exit(job2.waitForCompletion(true)?0:1);
            };

6、reducer-join

  1. job:通过FileInputFormat.addInputPath()添加不同的文件来源
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/user/hive/warehouse/testdb.db/testtab_small/*"));
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/user/hive/warehouse/testdb.db/testtab_small2/*"));
  1. mapper:根据文件路径不同,读取对应相同的key(列可能不同),value拼接来源标记,混合
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] words = line.split("\001");

            /*不同表的数据,根据文件路径,加AB前缀,放入同一key中*/
            String fpath = ((FileSplit) context.getInputSplit()).getPath().toString();
            System.out.println("**************fpath:"+fpath);
            if (fpath.toString().indexOf("testtab_small2") > 0) {
                context.write(new Text(words[1]), new Text("B#".concat(words[0])));
//                System.out.println("**************A#:"+"A#".concat(words[0]));
            }else if (fpath.toString().indexOf("testtab_small") > 0) {
                context.write(new Text(words[1]), new Text("A#".concat(words[0])));
//                System.out.println("**************B#:"+"B#".concat(words[0]));
            }

        }
  1. reducer:根据value的来源标记,分拆到不同的List;再把List的值笛卡尔积后输出。
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            //定义2个集合,分装不同来源的数据
            ArrayList<String> listA = new ArrayList<String>();
            ArrayList<String> listB = new ArrayList<String>();
            for (Text line : values) {
                System.out.println("**************reducer.values:#:"+line.toString());
                System.out.println("**************line.toString().indexOf(\"A#\")"+line.toString().indexOf("A#"));
                System.out.println("**************line.toString().indexOf(\"B#\")"+line.toString().indexOf("B#"));
                if (line.toString().indexOf("A#") > -1) {
                    System.out.println("**************reducer.values:A#:"+line.toString());
                    listA.add(line.toString().substring(2));
                }
                if (line.toString().indexOf("B#") > -1) {
                    listB.add(line.toString().substring(2));
                }
                System.out.println("listA.size():"+(listA.size()));
                System.out.println("listB.size():"+(listB.size()));
            }
            
            //2个集合求笛卡尔积
            for (String strA : listA) {
                System.out.println("**************listA#:"+strA);
                for (String strB : listB) {
                    System.out.println("**************listB#:"+strB);
                    context.write(key, new Text(strA.concat("/001").concat(strB)));
                }
            }


        }
  1. ps:reducer中是相同key值的前提,A、B表对应的数据,所以连接就是笛卡尔积。

7、mapper-join

  1. 适用场景:关联的一方是小文件,可以放入节点的内存中。
  2. 使用了hadoop的分布式缓存技术:job.addCacheFile(new URI())把小文件从hdfs拉取到运行节点,然后读取到内存。
job.addCacheFile(new URI("hdfs://localhost:9000/app/MDic.txt"));
		//等价于hadoop命令参数:-files,将文件添加到分布式缓存,以备将来被复制到任务节点。
  1. 使用了Mapper.setup():读取文件
@Override
			protected void setup(Context context) throws IOException, InterruptedException {
				BufferedReader bReader=new BufferedReader(new InputStreamReader(new FileInputStream("MDic.txt")));
				List<String> list= IOUtils.readLines(bReader);
				String[] dicWords=null;
				for(String string:list){
					dicWords=string.split("\t");
					mDic.put(dicWords[0],dicWords[1]);
				}
			}
			//setup()运行在map任务之前,可以进行初始化动作。
  1. 在mapper里直接拼接key,不需要reducer。
context.write(new Text(words[0]+"\t"+mDic.get(words[0])+"\t"+words[1]),NullWritable.get());

8、子查询

2个job,类似count(distinct)

9、count(distinct cola) group by colb

select dealid, count(distinct uid) num from order group by dealid;

当只有一个distinct字段时,如果不考虑Map阶段的Hash GroupBy,只需要将GroupBy字段和Distinct字段组合为map输出key,利用mapreduce的排序,同时将GroupBy字段作为reduce的key,在reduce阶段保存LastKey即可完成去重
上边的原理是: mr自动完成了dealid+uid组合的唯一值筛选,然后reducer中输出:dealid:uid就是分组去重值了。当然不能实现count功能。

ps:我自然想到的是groupby字段当做map的输出key,然后在reducer中统计map输出value的不重复值。但,这种方式reducer需要额外消耗内存。不如上边的方法高效。