hadoop统计单词个数用hadoop进行单词计数

转载

mob64ca14116c53 2023-08-14 22:17:47

文章标签 hadoop统计单词个数 hadoop mapreduce Word Text 文章分类 Hadoop 大数据

Wordcount是hadoop的入门程序，类似其他程序语言的hello world程序一般。这个程序简短，但是不简单。通过多种方式实现，加强对mapreduce理解，大有好处。下面是最近我学习hadoop总结，利用wordcount把Secondary sort、In Map aggregation、Task wordflow串起来，供以后查阅。

实现方法1：常规方式计算wordcount

在Map对每行文本分割出单词，Reduce对单词统计频次。代码略。

实现方法2（其实代码是错误的，但很经典）：输出结果按单词频次排序。自定义单词类实现，类属性包括单词名称、频次等，其中compareTo按频次排序，按照单词名称group。

代码实现如下：

1、定义单词类Word，继承WritableComparable，具有单词名称、单词频次等两个属性。其中，compareTo实现按照单词频次对Word类进行排序。

<span style="font-size:14px;">public class Word implements WritableComparable {
    String wordname;//单词名称，String类型
    int wordfreq;//单词频次，int类型

    public int compareTo(Object arg0) {
       //compreTo函数实现Word类按照单词频次排序
        Word wcother=(Word)arg0;
        return wordfreq-wcother.getWordfreq();
    }
    （省略get/set/toString等代码）
}</span>

2、定义Group Comparator。在compare中按照单词名称对比来实现Word类group。

<span style="font-size:14px;">public static class Wcgroup extends WritableComparator {
        protected Wcgroup() {
            super(Word.class,true);
        }
        @Override
        public int compare(WritableComparable a, WritableComparable b) {
            Word lhs = (Word)a;
            Word rhs = (Word)b;    
            return lhs.getWordname.compareTo(rhs.getWordname);
        }
    }</span>

3、定义map。按照空格对输入文本按行分割，规整单词名称，创建Word实例（单词频次为1）并输出。

<span style="font-size:14px;">public static class Wcordermap extends Mapper<LongWritable, Text, Word, IntWritable> {
        Word outputKey ;
        IntWritable outputValue = new IntWritable(1);
        int index=1;
        
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String []words = StringUtils.split(value.toString(), ' ');
            for(String thisword : words){
     if(thisword.endsWith(".")||thisword.endsWith(",")||thisword.endsWith(":")||thisword.endsWith("!")||thisword.endsWith("?")){
                    //规整单词名称（剔除末尾句号、逗号、冒号、感叹号、问号等）
                    thisword=thisword.substring(0, thisword.length()-1);
            }
                outputKey=new Word(thisword,1);//创建Word实例（单词频次为1）并输出。
                context.write(outputKey, outputValue);
       }
    }
}</span>

4、定义reduce。按照按Word统计每个group中的数量，把结果作为该单词频次放入对应属性。

<span style="font-size:14px;">public static class Wcorderreduce extends Reducer<Word, IntWritable, Word, IntWritable> {
        IntWritable outputValue = new IntWritable(0);
        protected void reduce(Word key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sumcount=0;
            while(values.iterator().hasNext()){
                sumcount++;
                values.iterator().next();
            }          
            key.setWordfreq(sumcount);//按Word统计每个group中的数量，把结果作为该单词频次放入对应属性。
            outputValue.set(sumcount);
            context.write(key, outputValue);
        }
    }</span>

程序输入：文本文件“a b c a a”

程序输出：(a,1), (b,1),(c,1),(a,2)

存在问题：单词“a”没有完全group，第1个a和第4、5个a没有group在一起。

问题原因：group comparator没有“全局”观点，不是将一个reduce中所有对比结果相等的记录都放在一起。而是只对比排序后数据的当前这条下一条。如果不相等，就认为下一条应该放到另外一个group中，并触发reduce调用把当前group记录传入。具体见http://wenda.chinahadoop.cn/question/2579。

解决方法：无法通过修改compareTo实现wordcount按照频次输出。因为这会影响group。

实现方法3：输出结果按照单词频次排序。通过两次mapreduce实现。第1次mapreduce即方法1中的数据，完成(单词名称,单词频次)统计。第2次mapreduce对第1次计算结果进行排序。

主要代码如下：

1、按照方法1，计算wordcount，输出(单词名称,单词频次)结果，存入hdfs。这里略。

2、对步骤1的结果，再次执行map，把单词频次作为key，单词名称作为value，进行输出。

<span style="font-size:14px;">public static class Wc2map extends Mapper<LongWritable, Text, IntWritable, Text> {
        protected void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException {
            String []words=StringUtils.split(value.toString(), '\t’);  \\前一个mapreduce输出，用\t进行分列
            if(words.length<2||words[0].isEmpty()){ \\避免读入”_SUCCESS”空文件
                return;
            }
            Text outputValue=new Text(words[0]); \\单词名称作为value
            IntWritable outputKey=new IntWritable(Integer.parseInt(words[1]));\\单词频次作为key
            context.write(outputKey, outputValue);\\完成输出
        }
}</span>

3、执行默认Reducer，输入(单词频次,单词名称)，输出按单词频次排序的结果。

<span style="font-size:14px;">job2.setReducerClass(Reducer.class);</span>

程序输入：文本文件“a b c a a”

程序输出：第1步mapreduce，输出(a,3)，(b,1)，(c,1)。第2步mapreduce，输出(1, b)，（1,c），（3，a）。

需要注意：第2步mapreduce，单纯执行map，设置setNumReduce(0)，无法实现单词频次排序。

实现方法4：输出结果按照单词频次排序。通过PriorityQueue实现。把mapreduce输出的最终（单词名称,单词频次），存入PriorityQueue，按照单词频次排序。

主要代码如下：

1、定义Word类，继承WritableComparable，compareTo按照单词频次进行排序，equals按照单词名称进行比较是否相等。

<span style="font-size:14px;">public class Word implements WritableComparable<Word> {
    private String wordname;
    private int wordfreq;
    
    public int compareTo(Word o) {
        return o.getWordfreq() - wordfreq;
    }
    public void incresefreq(){
        wordfreq++;
    }
    public boolean equals(Object obj) {
        // TODO Auto-generated method stub
        Word oldword = (Word)obj;
        return wordname.equals(oldword.getWordname());
    }
   （省略get/set/toString等方法）
}</span>

2、定义Mapper类。利用单词名称，构建Word类，并放到ArrayList中存储。如果该Word类在ArrayList中不存在，则直接插入ArrayList。如果已存在，则把wordfreq加1。

<span style="font-size:14px;">public static class Wcinmapmapper extends Mapper<LongWritable, Text, Text, IntWritable> {
        private ArrayList<Word> allword;
        private PriorityQueue<Word> queue;
        private  Text outputKey = new Text();
        private IntWritable outputValue = new IntWritable();
        
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String []words = StringUtils.split(value.toString().trim(), ' ');
            for(String thisword : words){
if(thisword.endsWith(".")||thisword.endsWith(",")||thisword.endsWith(":")||thisword.endsWith("!")||thisword.endsWith("?")){
                    thisword=thisword.substring(0, thisword.length()-1);
                }
                Word wordclass = new Word(thisword, 1); //构造Word类。
                int wordindex = allword.indexOf(wordclass); //判断Word类在ArrayList是否存在。如果存在则返回索引，否则返回-1。需要注意，判断是否Word类存在，使用了equals。
                if(wordindex==-1){
                    allword.add(wordclass); //Word类不存在，直接插入ArrayList。
                }else{
                    wordclass=allword.get(wordindex); //Word类已存在，则wordfreq加1，并把这个Word重新覆盖ArrayList对应位置。
                    wordclass.incresefreq();
                    allword.set(wordindex, wordclass);
                }
            }
        }</span>

3、在Mapper的cleanup中，把ArrayList写入PriorityQueue中，实现Word类按照wordfreq进行排序。

<span style="font-size:14px;">protected void cleanup( Context context) throws IOException, InterruptedException {
            queue = new PriorityQueue<Word>();
            queue.addAll(allword); //创建PriorityQueue，把ArrayList全部放进去，并对全部Word类按照compareTo进行排序。
            
            for(int i=1;i<=queue.size();i++){
                Word tail=queue.poll(); //从顶部返回Word类，并写入hdfs
                if(tail!=null){
                    outputKey.set(tail.getWordname());
                    outputValue.set(tail.getWordfreq());
                    context.write(outputKey, outputValue);
                }
            }
        }

        protected void setup(
                Context context)
                throws IOException, InterruptedException {
            allword = new ArrayList<Word>();
        }
    }</span>

程序输入：文本文件“a b c a a”

程序输出：输出(b，1），（c，1），（a，3）

需要注意：对于大数据量情况，使用PriorityQueue可能导致内存占用大的问题。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。