有个参数sequential决定是否本地执行,这里只讲MapReduce执行。
源代码如下,
private boolean runMapReduce (Map < string , List < String > > parsedArgs ) throws IOException,
InterruptedException, ClassNotFoundException {
Path model = new Path (getOption ( "model" ) ) ;
HadoopUtil. cacheFiles (model, getConf ( ) ) ;
//the output key is the expected value, the output value are the scores for all the labels
Job testJob = prepareJob (getInputPath ( ), getOutputPath ( ), SequenceFileInputFormat. class, BayesTestMapper. class,
Text. class, VectorWritable. class, SequenceFileOutputFormat. class ) ;
boolean complementary = parsedArgs. containsKey ( "testComplementary" ) ;
testJob. getConfiguration ( ). set (COMPLEMENTARY, String. valueOf (complementary ) ) ;
boolean succeeded = testJob. waitForCompletion ( true ) ;
return succeeded ;
}
首先从训练的模型中得到model,实例化model,也就是将写入的vectors重新读取出来罢了。
testJob只用到了map阶段,如下
protected void map (Text key, VectorWritable value, Context context ) throws IOException, InterruptedException {
Vector result = classifier. classifyFull (value. get ( ) ) ;
//the key is the expected value
context. write ( new Text (key. toString ( ). split ( "/" ) [ 1 ] ), new VectorWritable (result ) ) ;
}
输出的key就是类别的text,value就是输入的向量在每个类的得分。
classifier.classifyFull()计算输入的向量在每个label的得分:
public Vector classifyFull ( Vector instance ) {
Vector score = model. createScoringVector ( ) ;
for ( int label = 0 ; label < model. numLabels ( ) ; label ++ ) {
score. set (label, getScoreForLabelInstance (label, instance ) ) ;
}
return score ;
}
getScoreForLabelInstance如下,计算此label下的feature得分和。
protected double getScoreForLabelInstance ( int label, Vector instance ) {
double result = 0.0 ;
Iterator <element > elements = instance. iterateNonZero ( ) ;
while (elements. hasNext ( ) ) {
Element e = elements. next ( ) ;
result += e. get ( ) * getScoreForLabelFeature (label, e. index ( ) ) ;
}
return result ;
}
</element >
————————————————
getScoreForLabelFeature有两种计算方式,
1, 标准bayes ,log[(Wi+alphai)/(ƩWi + N)]
public double getScoreForLabelFeature ( int label, int feature ) {
NaiveBayesModel model = getModel ( ) ;
return
computeWeight (model. weight (label, feature ), model. labelWeight (label ), model. alphaI ( ),
model. numFeatures ( ) ) ;
}
public static double computeWeight ( double featureLabelWeight, double labelWeight, double alphaI,
double numFeatures ) {
double numerator = featureLabelWeight + alphaI ;
double denominator = labelWeight + alphaI * numFeatures ;
return Math. log (numerator / denominator ) ;
}
————————————————
2, complementary bayes,也就是计算除此类之外的其他类的值。
//complementary bayes
public double getScoreForLabelFeature ( int label, int feature ) {
NaiveBayesModel model = getModel ( ) ;
return computeWeight (model. featureWeight (feature ), model. weight (label, feature ),
model. totalWeightSum ( ), model. labelWeight (label ), model. alphaI ( ), model. numFeatures ( ) ) ;
}
public static double computeWeight ( double featureWeight, double featureLabelWeight,
double totalWeight, double labelWeight, double alphaI, double numFeatures ) {
double numerator = featureWeight - featureLabelWeight + alphaI ;
double denominator = totalWeight - labelWeight + alphaI * numFeatures ;
return - Math. log (numerator / denominator ) ;
}
最后就是analyze了,对每个key,通过score vector得到最大值,与label index比较。产生confusion matrix了。
http://hnote.org/big-data/mahout/mahout-testnaivebayesdriver-testnb