hadoop string 序列化 hadoop给数据排序

转载

autohost 2023-07-12 12:02:42

文章标签 hadoop string 序列化 Text hadoop apache 文章分类 Hadoop 大数据

(1)、关于mapreduce

mapreduce很适合数据之间相关性较低且数据量庞大的情况，map操作将原始数据经过特定操作打散后输出，作为中间结果，hadoop通过shuffle操作对中间结果排序，之后，reduce操作接收中间结果并进行汇总操作，最后将结果输出到文件中，从这里也可以看到在hadoop中，hdfs是mapreduce的基石。可以用下面这幅图描述map和reduce的过程：

hadoop string 序列化 hadoop给数据排序_Text

有人用这么一句话解释mapreduce：

We want to count all the books in the library. You count up shelf #1, I count up shelf #2. That’s map. The more people we get, the faster it goes.
我们要数图书馆中的所有书。你数1号书架，我数2号书架。这就是“Map”。我们人越多，数书就更快。

Now we get together and add our individual counts. That’s reduce.
现在我们到一起，把所有人的统计数加在一起。这就是“Reduce”。

(2)、数据准备

将待排序文本上传到hdfs上并放在input文件夹中，在终端执行：hadoop dfs –mkdir input；

假设数据文件data.txt放在本地磁盘的/home/leozhang/testdata中，在终端执行：cd /home/leozhang/testdata；hadoop dfs –put data input/

(3)、排序思路

借鉴快速排序的思路：假设为升序排序，那么每完成一次partition，pivot左边所有元素的值都小于等于pivot，而pivot右边的所有元素的值都大于等于pivot，如果现在有N个pivot，那么数据就被map成了N+1个区间，让reducer个数等于N+1，将不同区间的数据发送到相应区间的reducer；hadoop利用shuffle操作将这N+1份数据自动排序，reduce操作只需要接收中间结果后直接输出到文件即可。

由此归纳出用hadoop对大量数据排序的步骤：

1）、对待排序数据进行抽样；

2）、对抽样数据进行排序，产生pivot（例如得到的pivot为：3,9,11）；

3）、Map对输入的每条数据计算其处于哪两个pivot之间，之后将数据发给相应的reduce（例如区间划分为：<3、[3,9)、>=9，分别对应reducer0、reducer1、reducer2）；

4）、Reduce将获得数据直接输出。

(4)、简单实现

数据抽样由：RandomSelectMapper和RandomSelectReducer完成，数据划分由ReducerPatition完成，排序输出由SortMapper和SortReducer完成，执行顺序为：RandomSelectMapper –> RandomSelectReducer –> SortMapper –> SortReducer。

这个实现方式总觉得不给力，尤其是数据划分那块儿，不知道大家会怎么做，指导一下我吧，呵呵。代码可以从这里（javascript:void(0)）得到。

1）、pivot的选取采用随机的方式：

package MRTEST.Sort;

   import java.io.IOException;
   import java.util.Random;
   import java.util.StringTokenizer;
   import org.apache.hadoop.io.Text;
   import org.apache.hadoop.mapreduce.Mapper;

   public class RandomSelectMapper
           extends Mapper<Object, Text, Text, Text>{

       private static int currentSize = 0;
       private Random random = new Random();

       public void map(Object key, Text value, Context context)

           throws IOException, InterruptedException{
           StringTokenizer itr = new StringTokenizer(value.toString());

           while(itr.hasMoreTokens()){
               currentSize++;
               Random ran = new Random();

               if(random.nextInt(currentSize) == ran.nextInt(1)){
                     Text v = new Text(itr.nextToken());
                       context.write(v, v);
               }
               else{
                   itr.nextToken();
               }
           }
       }
   }

pivot的排序由hadoop完成：

package MRTEST.Sort;

   import java.io.IOException;
   import org.apache.hadoop.io.Text;
   import org.apache.hadoop.mapreduce.Reducer;

   public class RandomSelectReducer
           extends Reducer<Text, Text, Text, Text>{

       public void reduce(Text key, Iterable<Text> values, Context context)
           throws IOException, InterruptedException{

           for (Text data : values) {
               context.write(null,data);
               break;
           }
       }
   }

2）、SortMapper直接读取数据：

package MRTEST.Sort;

   import java.io.IOException;
   import java.util.StringTokenizer;
   import org.apache.hadoop.io.Text;
   import org.apache.hadoop.mapreduce.Mapper;

   public class SortMapper
           extends Mapper<Object, Text, Text, Text> {

       public void map(Object key, Text values,
               Context context) throws IOException,InterruptedException {

             StringTokenizer itr = new StringTokenizer(values.toString());

             while (itr.hasMoreTokens()) {
                 Text v = new Text(itr.nextToken());
                   context.write(v, v);
           }
       }
   }

向相应的Reducer分发数据：

package MRTEST.Sort;

   import org.apache.hadoop.io.Text;
   import org.apache.hadoop.mapreduce.Partitioner;

   public class ReducerPartition
           extends Partitioner<Text, Text>{

       public int getPartition(Text key, Text value ,int numPartitions){
           return HadoopUtil.getReducerId(value, numPartitions);
       }
   }

最后由SortReducer输出结果：

package MRTEST.Sort;

   import java.io.IOException;
   import org.apache.hadoop.io.Text;
   import org.apache.hadoop.mapreduce.Reducer;

   public class SortReducer
           extends Reducer<Text, Text, Text, Text> {

       public void reduce(Text key, Iterable<Text> values,
               Context context) throws IOException, InterruptedException {

           for (Text data : values) {
               context.write(key,data);
           }
       }
   }

3)、作业的组织由SortDriver完成：

package MRTEST.Sort;

   import java.io.IOException;
   import org.apache.hadoop.conf.Configuration;
   import org.apache.hadoop.fs.Path;
   import org.apache.hadoop.io.Text;
   import org.apache.hadoop.mapreduce.Job;
   import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
   import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
   import org.apache.hadoop.util.GenericOptionsParser;

   public class SortDriver {

       public static void runPivotSelect(Configuration conf,
                                         Path input,
                                         Path output) throws IOException, ClassNotFoundException, InterruptedException{

           Job job = new Job(conf, "get pivot");
           job.setJarByClass(SortDriver.class);
           job.setMapperClass(RandomSelectMapper.class);
           job.setReducerClass(RandomSelectReducer.class);
           job.setOutputKeyClass(Text.class);
           job.setOutputValueClass(Text.class);
           FileInputFormat.addInputPath(job, input);
           FileOutputFormat.setOutputPath(job, output);

           if(!job.waitForCompletion(true)){
               System.exit(2);
           }
       }

       public static void runSort(Configuration conf,
                                  Path input,
                                  Path partition,
                                  Path output) throws IOException, ClassNotFoundException, InterruptedException{

           Job job = new Job(conf, "sort");
           job.setJarByClass(SortDriver.class);
           job.setMapperClass(SortMapper.class);
           job.setCombinerClass(SortReducer.class);
           job.setPartitionerClass(ReducerPartition.class);
           job.setReducerClass(SortReducer.class);
           job.setOutputKeyClass(Text.class);
           job.setOutputValueClass(Text.class);

           HadoopUtil.readPartition(conf, new Path(partition.toString() + "\\part-r-00000"));
           job.setNumReduceTasks(HadoopUtil.pivots.size());
           FileInputFormat.addInputPath(job, input);
           FileOutputFormat.setOutputPath(job, output);

           System.exit(job.waitForCompletion(true) ? 0 : 1);
       }

       public static void main(String[] args) throws Exception {
           Configuration conf = new Configuration();
           String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

           if (otherArgs.length != 3) {
             System.err.println("Usage: sort <input> <partition> <output>");
             System.exit(2);
           }

           Path input = new Path(otherArgs[0]);
           Path partition = new Path(otherArgs[1]);
           Path output = new Path(otherArgs[2]);

           HadoopUtil.delete(conf, partition);
           HadoopUtil.delete(conf, output);

           SortDriver.runPivotSelect(conf,input,partition);
           SortDriver.runSort(conf,input, partition, output);

       }

   }

(5)、打包并测试

在master机器上，单击eclipse的File菜单中的Export，选择Java –> JAR file，单击Next，在左边树形结构中把你想打包的文件勾选，单击Next，再单击Next，在Main class里选择应用程序入口(可选项)，最后点Finish，可以看到一个jar文件，例如：Sort.jar。

进入Sort.jar所在路径，在终端输入：hadoop jar Sort.jar input partition output