mahout之TestNaiveBayesDriver源码分析

原创

wbj0110 2023-05-14 10:25:45 博主文章分类：Mahout ©著作权

文章标签 Mahout Text Math 实例化 文章分类 JavaScript 前端开发

©著作权归作者所有：来自51CTO博客作者wbj0110的原创作品，请联系作者获取转载授权，否则将追究法律责任

有个参数sequential决定是否本地执行，这里只讲MapReduce执行。
源代码如下，

  private  boolean runMapReduce (Map < string , List <  String  >  > parsedArgs )  throws  IOException,
       InterruptedException,  ClassNotFoundException  {
    Path model  =  new Path (getOption ( "model" ) ) ;
    HadoopUtil. cacheFiles (model, getConf ( ) ) ;
     //the output key is the expected value, the output value are the scores for all the labels
    Job testJob  = prepareJob (getInputPath ( ), getOutputPath ( ), SequenceFileInputFormat. class, BayesTestMapper. class,
            Text. class, VectorWritable. class, SequenceFileOutputFormat. class ) ;
     boolean complementary  = parsedArgs. containsKey ( "testComplementary" ) ;
    testJob. getConfiguration ( ). set (COMPLEMENTARY,  String. valueOf (complementary ) ) ;
     boolean succeeded  = testJob. waitForCompletion ( true ) ;
     return succeeded ;
   }

首先从训练的模型中得到model，实例化model，也就是将写入的vectors重新读取出来罢了。
testJob只用到了map阶段，如下

   protected  void map (Text key, VectorWritable value,  Context context )  throws  IOException,  InterruptedException  {
     Vector result  = classifier. classifyFull (value. get ( ) ) ;
     //the key is the expected value
    context. write ( new Text (key. toString ( ). split ( "/" ) [ 1 ] ),  new VectorWritable (result ) ) ;
   }

输出的key就是类别的text，value就是输入的向量在每个类的得分。
classifier.classifyFull()计算输入的向量在每个label的得分：

public  Vector classifyFull ( Vector instance )  {
     Vector score  = model. createScoringVector ( ) ;
     for  ( int label  =  0 ; label  < model. numLabels ( ) ; label ++ )  {
      score. set (label, getScoreForLabelInstance (label, instance ) ) ;
     }
     return score ;
   }

getScoreForLabelInstance如下，计算此label下的feature得分和。

protected  double getScoreForLabelInstance ( int label,  Vector instance )  {
     double result  =  0.0 ;
    Iterator <element > elements  = instance. iterateNonZero ( ) ;
     while  (elements. hasNext ( ) )  {
       Element e  = elements. next ( ) ;
      result  += e. get ( )  * getScoreForLabelFeature (label, e. index ( ) ) ;
     }
     return result ;
   }
</element >
————————————————

getScoreForLabelFeature有两种计算方式，
1，标准bayes ，log[(Wi+alphai)/(ƩWi + N)]

public  double getScoreForLabelFeature ( int label,  int feature )  {
    NaiveBayesModel model  = getModel ( ) ;
return
computeWeight (model. weight (label, feature ), model. labelWeight (label ), model. alphaI ( ),
        model. numFeatures ( ) ) ;
   }
   public  static  double computeWeight ( double featureLabelWeight,  double labelWeight,  double alphaI,
       double numFeatures )  {
     double numerator  = featureLabelWeight  + alphaI ;
     double denominator  = labelWeight  + alphaI  * numFeatures ;
     return  Math. log (numerator  / denominator ) ;
   }
————————————————

2, complementary bayes,也就是计算除此类之外的其他类的值。

//complementary bayes
     public  double getScoreForLabelFeature ( int label,  int feature )  {
    NaiveBayesModel model  = getModel ( ) ;
     return computeWeight (model. featureWeight (feature ), model. weight (label, feature ),
        model. totalWeightSum ( ), model. labelWeight (label ), model. alphaI ( ), model. numFeatures ( ) ) ;
   }
   public  static  double computeWeight ( double featureWeight,  double featureLabelWeight,
       double totalWeight,  double labelWeight,  double alphaI,  double numFeatures )  {
     double numerator  = featureWeight  - featureLabelWeight  + alphaI ;
     double denominator  = totalWeight  - labelWeight  + alphaI  * numFeatures ;
     return  - Math. log (numerator  / denominator ) ;
   }

最后就是analyze了，对每个key，通过score vector得到最大值，与label index比较。产生confusion matrix了。

http://hnote.org/big-data/mahout/mahout-testnaivebayesdriver-testnb