spark tf idf spark tf idf group

转载

mob6454cc7416d1 2023-10-26 23:31:15

文章标签 spark tf idf spark 数据 apache 文章分类 Spark 大数据

所用或所学知识，忘了搜，搜了忘，还不如在此记下，还能让其他同志获知。

在使用spark实现机器学习相关算法过程中，档语料或者数据集是中文文本时，使用spark实现机器学习相关的算法需要把中文文本转换成Vector或LabeledPoint等格式的数据，需要用到TF-IDF工具。

何为TF-IDF

TF(Term Frequency)：表示某个单词或短语在某个文档中出现的频率，说白了就是词频，其公式：

$TF_{i,j} = \frac{n_{i,j}}{\sum n_{i,j}}$

其中，分子表示该单词在文件中出现的次数，分母表示在文件中所有单词的出现次数之和。

IDF(Inverse Document Frequency，逆向文件频率)：在所有文档中，若包含某个单词（此单词就是自己定义的，需要获取其TF*IDF值的）的文档越少，即单词个数越少，则IDF值就越大，则说明某个单词具有良好的类别区分能力。某一个单词的IDF可有总文件数除以包含该单词文件的数目，再将得到的商取以10为底的对数，其公式：

$IDF_{i}=lg\frac{|D|}{|\left \{ j:t_{i} \subset d_{j} \right \}|}$

其中，|D|表示语料库中的文件总数；：表示包含单词的文件数目。

然后再计算TF与IDF的乘积，其公式为：

$TFIDF_{i,j} = TF_{i,f} \times IDF_{i}$

但是在实际应用过程中，并不是所有的单词或短语都必须进行计算TF_IDF值，如“的”、“我们”、“在”等词语，这些助词、介词或短语在实际中被当做停用词，需要特定的工具或手段把这些停用词过滤掉，我使用中文分词工具是ANSJ，基本上能满足要求，而且准确率较高，ANSJ的下载地址：

https://oss.sonatype.org/content/repositories/releases/org/ansj/ansj_seg/

https://oss.sonatype.org/content/repositories/releases/org/nlpcn/nlp-lang/

使用ANSJ工具进行中文分词的代码：

import java.io.InputStream
import java.util

import org.ansj.domain.Result
import org.ansj.recognition.impl.StopRecognition
import org.ansj.splitWord.analysis.ToAnalysis
import org.ansj.util.MyStaticValue
import org.apache.spark.{SparkConf, SparkContext}
import org.nlpcn.commons.lang.tire.domain.{Forest, Value}
import org.nlpcn.commons.lang.tire.library.Library
import org.nlpcn.commons.lang.util.IOUtil

class ChineseSegment extends Serializable {

  @transient private val sparkConf: SparkConf = new SparkConf().setAppName("chinese segment")
  @transient private val sparkContext: SparkContext = SparkContext.getOrCreate(sparkConf)

  private val stopLibRecog = new StopLibraryRecognition
  private val stopLib: util.ArrayList[String] = stopLibRecog.stopLibraryFromHDFS(sparkContext)
  private val selfStopRecognition: StopRecognition = stopLibRecog.stopRecognitionFilter(stopLib)

  private val dicUserLibrary = new DicUserLibrary
  @transient private val aListDicLibrary: util.ArrayList[Value] = dicUserLibrary.getUserLibraryList(sparkContext)
  @transient private val dirLibraryForest: Forest = Library.makeForest(aListDicLibrary)

  /**中文分词和模式识别*/
  def cNSeg(comment : String) : String = {

    val result: Result = ToAnalysis.parse(comment,dirLibraryForest).recognition(selfStopRecognition)
    result.toStringWithOutNature(" ")
  }


}


/**停用词典识别：
  * 格式： 词语  停用词类型[可以为空]  使用制表符Tab进行分割
  * 如：
  * #
  * v nature
  * .*了 regex
  *
  * */

class StopLibraryRecognition extends Serializable {

  def stopRecognitionFilter(arrayList: util.ArrayList[String]): StopRecognition ={

    MyStaticValue.isQuantifierRecognition = true //数字和量词合并

    val stopRecognition = new StopRecognition

    //识别评论中的介词（p）、叹词（e）、连词（c）、代词（r）、助词（u）、字符串（x）、拟声词（o）
    stopRecognition.insertStopNatures("p", "e", "c", "r", "u", "x", "o")

    stopRecognition.insertStopNatures("w")  //剔除标点符号

    //剔除以中文数字开头的，以一个字或者两个字为删除单位，超过三个的都不删除
    stopRecognition.insertStopRegexes("^一.{0,2}","^二.{0,2}","^三.{0,2}","^四.{0,2}","^五.{0,2}",
      "^六.{0,2}","^七.{0,2}","^八.{0,2}","^九.{0,2}","^十.{0,2}")

    stopRecognition.insertStopNatures("null") //剔除空

    stopRecognition.insertStopRegexes(".{0,1}")  //剔除只有一个汉字的

    stopRecognition.insertStopRegexes("^[a-zA-Z]{1,}")  //把分词只为英文字母的剔除掉

    stopRecognition.insertStopWords(arrayList)  //添加停用词

    stopRecognition.insertStopRegexes("^[0-9]+") //把分词只为数字的剔除

    stopRecognition.insertStopRegexes("[^a-zA-Z0-9\u4e00-\\u9fa5]+")  //把不是汉字、英文、数字的剔除

    stopRecognition
  }


  /**停用词格式：
  导演
  上映
  终于
  加载
  中国*/
  def stopLibraryFromHDFS(sparkContext: SparkContext): util.ArrayList[String] ={
    /** 获取stop.dic文件中的数据 方法二：
      * 在集群上运行的话，需要把stop的数据放在hdfs上，这样集群中所有的节点都能访问到停用词典的数据 */
    val stopLib: Array[String] = sparkContext.textFile("hdfs://zysdmaster000:8020/data/library/stop.dic").collect()
    val arrayList: util.ArrayList[String] = new util.ArrayList[String]()
    for (i<- 0 until stopLib.length)arrayList.add(stopLib(i))

    arrayList

  }
}


/**用户自定义词典：
  * 格式：词语 词性  词频
  * 词语、词性和词频用制表符分开（Tab）
  * 如：
  * 足球王者        define  1513
  * 妈妈咪呀2       define  1514
  * 黄金兄弟        define  1515
  * 江湖儿女        define  1516
  * 一生有你        define  1517
  *
  * */
class DicUserLibrary extends Serializable {

  def getUserLibraryList(sparkContext: SparkContext): util.ArrayList[Value] = {
    /** 获取userLibrary.dic文件中的数据 方法二：
      * 在集群上运行的话，需要把userLibrary的数据放在hdfs上，这样集群中所有的节点都能访问到user library的数据 */
    val va: Array[String] = sparkContext.textFile("hdfs://zysdmaster000:8020/data/library/userLibrary.dic").collect()
    val arrayList: util.ArrayList[Value] = new util.ArrayList[Value]()
    for (i <- 0 until va.length)arrayList.add(new Value(va(i)))
    arrayList
  }
}

分词结果如下：

spark tf idf spark tf idf group_spark tf idf_04

分词成功后需要计算每个词的TF值，在这里使用HashTF类，其TF值的结果如下：

spark tf idf spark tf idf group_spark tf idf_05

以标签为1的计算结果为例，其中262144表示哈希表的桶数，198759表示“祖国”的哈希值，1.0表示“祖国”这个单词出现的次数。

由TF获取TFIDF值是调用IDF、IDFModel两个类实现的，其结果如下：

spark tf idf spark tf idf group_spark_06

以标签1的计算结果为例，其中262144表示哈希表的桶数，198759表示“祖国”的哈希值，0.8472978603872037表示“祖国”的TF-IDF的计算结果值。其整个程序的代码：

import org.apache.spark.ml.feature.{HashingTF, IDF, IDFModel, Tokenizer}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}
object TestML {

  def data(sparkContext: SparkContext): DataFrame ={

    val sqlContext = new SQLContext(sparkContext)
    import sqlContext.implicits._

    val chineseSegment = new ChineseSegment

    val data = sparkContext.parallelize(Seq(
      (1,"我爱我的祖国"),
      (2,"中华人民共和国万岁"),
      (3,"祖国万岁"),
      (4,"中华文明万岁"),
      (5,"我是小学生"),
      (6,"中华人民共和国正在雄起"))
    ).map{x =>
      val str = chineseSegment.cNSeg(x._2)
      (x._1,str)
    }.toDF("id","context")

    data

  }

  def computeTFIDF(dataFrame: DataFrame): Unit ={

    //把分词结果转换为数组
    val tokenizer: Tokenizer = new Tokenizer().setInputCol("context").setOutputCol("words")
    val wordData: DataFrame = tokenizer.transform(dataFrame)

    //对分词结果进行TF计算
    val hashingTF: HashingTF = new HashingTF().setInputCol("words").setOutputCol("tfvalues")
    val tfDFrame: DataFrame = hashingTF.transform(wordData)
//    tfDFrame.select("id","words","tfvalues").foreach(println)

    //根据获取的TF值进行IDF计算
    val idf: IDF = new IDF().setInputCol("tfvalues").setOutputCol("rfidfValues")
    val idfModel: IDFModel = idf.fit(tfDFrame)
    val dfidfDFrame: DataFrame = idfModel.transform(tfDFrame)

    //评论对应的DF-IDF
    dfidfDFrame.select("id","words", "rfidfValues").foreach(println)
  }

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("testml")
    val sparkContext = new SparkContext(sparkConf)
    val dataFrame = data(sparkContext)
    computeTFIDF(dataFrame)
  }
}

以上是使用Spark MLlib框架计算中文单词的tf-idf值，TF-IDF值可有效表示一个单词（短语）对于文章集或语料集中的其中一部分文件的重要程度，但是使用tf-idf表现某个单词在文档中的重要性并不是万能的。

例如：

设该单词所在的文档数x，若x越大，IDF就越大，TFIDF值就越小，该单词越不能代表该文档，反之亦然。

但是当该词只存在于某个或某几个文档中，其他文档都没有此单词，则该单词只能说明代表所在的文档，并不能代表其他不存在该单词的文档。因此，TF-IDF没有考虑特征词在各类的文档（各文档之间）频率的差异性。TF-IDF主要存在的问题：

1、忽略了特征词在类之间的分布情况。

该选择的特征词在某类中出现的次数多，而在其他类中出现的次数少，选取的特征词在各类别之间具有较大的差异性，TF-IDF不能区分特征词在各个类别之间是否分布均匀；

2、忽略特征词在同一个类别中内部文档之间的分布情况。

在同一个类别数据集中，若选择的特征词均匀分布其中，则这个特征词能较好的反应这个类的特征，若选择的特征词只分布在其中几个文档中，在其他文档中没有出现，则选择的特征词的TF-IDF值即使很大，也不能代表这个类别的特征。

因此，若需要对文本特征进行特征提取或降维等操作，最好使用卡方、信息增益等方法。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。