TF-IDF介绍

Spark机器学习——TF-IDF算法_spark

Spark机器学习——TF-IDF算法_spark_02

Spark机器学习——TF-IDF算法_词频_03

上面是TF-IDF算法的公式。这里从一个实例开始说起。假定现在有一篇长文《中国的蜜蜂养殖》,我们准备用程序提取它的关键词。

Spark机器学习——TF-IDF算法_spark_04

一个容易想到的思路,就是找到出现次数最多的词。如果某个词很重要,它应该在这篇文章中多次出 现。于是,我们进行词频(Term Frequency,缩写为TF)统计。

结果你肯定猜到了,出现次数最多的词是——"的"、"是"、"在"——这一类最常用的词。它们叫做停用词(stop words),表示对找到结果毫无帮助、必须过滤掉的词。

假设我们把它们都过滤掉了,只考虑剩下的有实际意义的词。这样又会遇到了另一个问题,我们可能发现"中国"、"蜜蜂"、"养殖"这三个词的出现次数一样多。这是不是意味着,作为关键词,它们的重要性是一样的?

显然不是这样。因为"中国"是很常见的词,相对而言,"蜜蜂"和"养殖"不那么常见。如果这三个词在一 篇文章的出现次数一样多,有理由认为,"蜜蜂"和"养殖"的重要程度要大于"中国",也就是说,在关键词排序上面,"蜜蜂"和"养殖"应该排在"中国"的前面。

所以,我们需要一个重要性调整系数,衡量一个词是不是常见词。如果某个词比较少见,但是它在这篇 文章中多次出现,那么它很可能就反映了这篇文章的特性,正是我们所需要的关键词。

用统计学语言表达,就是在词频的基础上,要对每个词分配一个"重要性"权重。最常见的词 ("的"、"是"、"在")给予最小的权重,较常见的词("中国")给予较小的权重,较少见的词("蜜 蜂"、"养殖")给予较大的权重。这个权重叫做"逆文档频率"(Inverse Document Frequency,缩写 为IDF),它的大小与一个词的常见程度成反比

知道了"词频"(TF)和"逆文档频率"(IDF)以后,将这两个值相乘,就得到了一个词的TF-IDF值。某个词对文章的重要性越高,它的TF-IDF值就越大。所以,排在最前面的几个词,就是这篇文章的关键词。

TF-IDF计算

1.计算词频

Spark机器学习——TF-IDF算法_spark_05

考虑到文章有长短之分,为了便于不同文章的比较,进行"词频"标准化。

Spark机器学习——TF-IDF算法_词频_06

或者

Spark机器学习——TF-IDF算法_spark_07

2.计算逆文档频率

这时,需要一个语料库(corpus),用来模拟语言的使用环境。

Spark机器学习——TF-IDF算法_spark_08

如果一个词越常见,那么分母就越大,逆文档频率就越小越接近0。分母之所以要加1,是为了避免分母 为0(即所有文档都不包含该词)。log表示对得到的值取对数。

3.计算TF-IDF

Spark机器学习——TF-IDF算法_词频_03

可以看到,TF-IDF与一个词在文档中的出现次数成正比,与该词在整个语言中的出现次数成反比。所 以,自动提取关键词的算法就很清楚了,就是计算出文档的每个词的TF-IDF值,然后按降序排列,取排 在最前面的几个词。

还是以《中国的蜜蜂养殖》为例,假定该文长度为1000个词,"中国"、"蜜蜂"、"养殖"各出现20次,则 这三个词的"词频"(TF)都为0.02。然后,搜索Google发现,包含"的"字的网页共有250亿个,假定这就是中文网页总数。包含"中国"的网页共有62.3亿个,包含"蜜蜂"的网页为0.484亿个,包含"养殖"的网页为0.973亿张。则它们的逆文档频率(IDF)和TF-IDF如下:


包含该词的文档数(亿)

IDF

TF-IDF

中国

62.3

0.603

0.0121

蜜蜂

0.484

2.713

0.0543

养殖

0.973

2.410

0.0482

从上表可见,"蜜蜂"的TF-IDF值最高,"养殖"其次,"中国"最低。(如果还计算"的"字的TF-IDF,那将是 一个极其接近0的值。)所以,如果只选择一个词,"蜜蜂"就是这篇文章的关键词。

除了自动提取关键词,TF-IDF算法还可以用于许多别的地方。比如,信息检索时,对于每个文档,都可 以分别计算一组搜索词("中国"、"蜜蜂"、"养殖")的TF-IDF,将它们相加,就可以得到整个文档的TFIDF。这个值最高的文档就是与搜索词最相关的文档。

TF-IDF算法的优点是简单快速,结果比较符合实际情况。缺点是,单纯以"词频"衡量一个词的重要性, 不够全面,有时重要的词可能出现次数并不多。而且,这种算法无法体现词的位置信息,出现位置靠前 的词与出现位置靠后的词,都被视为重要性相同,这是不正确的。(一种解决方法是,对全文的第一段 和每一段的第一句话,给予较大的权重。)

TF-IDF之Spark实例

官方案例链接

http://spark.apache.org/docs/2.2.2/mllib-feature-extraction.html#tf-idf

语料库下载链接

https://download.csdn.net/download/a805814077/14935841

Maven依赖

<properties>
    <scala.version>2.11.8</scala.version>
    <spark.version>2.2.2</spark.version>
    <hadoop.version>2.7.6</hadoop.version>
</properties>
<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_2.11</artifactId>
        <version>${spark.version}</version>
    </dependency>
</dependencies>

TFIDF.scala

package ml

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.feature.{HashingTF, IDF}
import org.apache.spark.mllib.linalg
import org.apache.spark.mllib.linalg.SparseVector
import org.apache.spark.rdd.RDD

/**
  * @Author Daniel
  * @Description Spark TF-IDF实现
  *              输入文件格式(文件名,内容)
  *              输出格式(文件名,前20个特征关键字)
  *              思路:
  *              用Spark的MLlib实现TF-IDF,由于MLlib输出的结果是该文章所有单词的TF-IDF,
  *              格式需要转换,所以再使用zip函数得到结果
  **/
object TFIDF {
  def main(args: Array[String]): Unit = {
    run()
  }

  def initialize(content: String): Seq[String] = {
    // 英文停用词列表
    val stopWords = Set("very", "ourselves", "am", "doesn", "through", "me", "against", "up", "just", "her", "ours",
      "couldn", "because", "is", "isn", "it", "only", "in", "such", "too", "mustn", "under", "their",
      "if", "to", "my", "himself", "after", "why", "while", "can", "each", "itself", "his", "all", "once",
      "herself", "more", "our", "they", "hasn", "on", "ma", "them", "its", "where", "did", "ll", "you",
      "didn", "nor", "as", "now", "before", "those", "yours", "from", "who", "was", "m", "been", "will",
      "into", "same", "how", "some", "of", "out", "with", "s", "being", "t", "mightn", "she", "again", "be",
      "by", "shan", "have", "yourselves", "needn", "and", "are", "o", "these", "further", "most", "yourself",
      "having", "aren", "here", "he", "were", "but", "this", "myself", "own", "we", "so", "i", "does", "both",
      "when", "between", "d", "had", "the", "y", "has", "down", "off", "than", "haven", "whom", "wouldn",
      "should", "ve", "over", "themselves", "few", "then", "hadn", "what", "until", "won", "no", "about",
      "any", "that", "for", "shouldn", "don", "do", "there", "doing", "an", "or", "ain", "hers", "wasn",
      "weren", "above", "a", "at", "your", "theirs", "below", "other", "not", "re", "him", "during", "which")
    // .r代表将字符转为正则表达式,这里是匹配非数字字符串,直到是数字为止
    val regexNum = "[^0-9]*".r
    // \\W+:等价于[^A-Za-z0-9_],即匹配不属于字母、数字、字符(大小写均可)或下划线的
    content.split("\\W+")
      .map(_.toLowerCase)
      // 过滤数字
      .filter(regexNum.pattern.matcher(_).matches)
      // 过滤停用词
      .filterNot(stopWords.contains)
      // 过滤单词长度小于2的
      .filter(_.length > 2) //filter the word that it's length less than 2
      .toSeq
  }

  def run() {
    val conf = new SparkConf()
      .setAppName("TFIDF")
      .setMaster("local[*]")
      // 如果输出文件存在则覆盖
      .set("spark.hadoop.validateOutputSpecs", "false")
    val sc = new SparkContext(conf)
    val path = "src/main/resources/electronics"
    /*
      这里说一下textFile方法与wholeTextFiles的区别
      textFile:
         它的设计初衷就是读大文件的,它能读多个文件是因为map输出会输出成多个文件;
         文件的每一行相当于列表中的一个元素,因此可以在每个partition中用for i in data的形式遍历处理数据.
      wholeTextFiles:
         它被设计用来读多级多目录的小文件的,所以读文件连行都不划分,直接读成文本字符串;
         它将每个文件作为一个记录,返回的是[(key, val), (key, val)...]的形式,其中key是文件路径,val是文件内容
     */
    val textRdd = sc.wholeTextFiles(path)
    // 文件名
    val titles = textRdd.map(_._1)
    // 内容
    val contents = textRdd.map(_._2).map(initialize)
    val hashingTF = new HashingTF()
    // 记录所有单词的下标映射
    val mapWords = contents.flatMap(x => x).map(w => (hashingTF.indexOf(w), w)).collect.toMap
    // 将输入内容转换为词频向量
    val tf = hashingTF.transform(contents)
    // 创建广播变量
    val bcWords = tf.context.broadcast(mapWords)
    // 持久化RDD
    tf.cache()
    val idf = new IDF(2).fit(tf)
    // 权重
    val tfidf: RDD[linalg.Vector] = idf.transform(tf)
    val r = tfidf.map {
      case SparseVector(size, indices, values) =>
        //     // 使用之前的mapWord把对应的index转化成具体的单词。格式(单词,tfidf权重)
        val words = indices.map(index => bcWords.value.getOrElse(index, "null"))
        // 取前20个单词
        words.zip(values).sortBy(-_._2).take(20).toSeq
    }
    //此时这里的title和r都是通过同一个rdd的map方法得到的,其分区保存不变,所以可以进行zip操作,组合对应的title和对应内容的(单词,tfidf权重)
    titles.zip(r).saveAsTextFile("output")
  }


}

部分结果

(file:/C:/project/scala/src/main/resources/electronics/53968,WrappedArray((ctstateu,13.770169644534128), (receiver,13.073202832782895), (remote,11.337378995885143), (remotes,9.755477241259648), (contained,9.755477241259648), (wcsub,9.180113096356086), (uses,8.522570080106352), (purdue,8.369182880139757), (ecn,8.369182880139757), (dynamo,8.369182880139757), (carrier,7.558252663923429), (specs,6.982888519019867), (manufacturer,6.536601416391448), (ground,6.4075243741163055), (detector,6.2862751304834354), (shack,6.171958302803538), (control,5.863656943149022), (build,5.3610280865872095), (different,5.084727409625575), (infra,4.877738620629824)))
(file:/C:/project/scala/src/main/resources/electronics/53972,WrappedArray((oversampling,41.310508933602385), (player,20.922957200349394), (filter,20.02376147809605), (netcom,17.892278182497815), (times,17.892278182497815), (mcmahan,17.467651987455334), (filtering,16.12176304097048), (filters,15.116505327846857), (samples,14.633215861889472), (interpolation,13.770169644534128), (dave,13.073202832782895), (higher,12.572550260966871), (kolstad,11.337378995885143), (method,10.234204655509192), (rate,9.804902124587171), (spec,9.180113096356086), (lets,9.180113096356086), (frequency,8.655925367818853), (points,8.06088152048524), (cpu,8.06088152048524)))
(file:/C:/project/scala/src/main/resources/electronics/53981,WrappedArray((current,13.99148539474994), (temp,8.733825993727667), (thomas,8.06088152048524), (voltage,7.827165237934379), (case,5.596594157899976), (must,5.4365087425529035), (rms,4.877738620629824), (voltatge,4.877738620629824), (designated,4.877738620629824), (maximum,4.877738620629824), (stay,4.877738620629824), (bandgap,4.877738620629824), (berkeley,4.590056548178043), (uclink,4.590056548178043), (federal,4.590056548178043), (ripple,4.590056548178043), (acollins,4.590056548178043), (vdc,4.366912996863833), (limiter,4.366912996863833), (limiting,4.366912996863833)))
(file:/C:/project/scala/src/main/resources/electronics/53983,WrappedArray((sound,16.78978247369993), (adcom,14.633215861889472), (better,9.251157052673149), (greg,8.733825993727667), (improved,4.877738620629824), (factory,4.877738620629824), (replacing,4.877738620629824), (gather,4.877738620629824), (mods,4.877738620629824), (dumped,4.877738620629824), (class,4.1845914400698785), (appear,4.03044076024262), (folks,4.03044076024262), (art,4.03044076024262), (cover,4.03044076024262), (etc,3.974733725467319), (agree,3.8969093676180977), (necessary,3.8969093676180977), (bottom,3.8969093676180977), (mean,3.7791263319617143)))
(file:/C:/project/scala/src/main/resources/electronics/53984,WrappedArray((octave,39.02190896503859), (noise,27.77381236261592), (pink,20.922957200349394), (plumpe,19.510954482519296), (bands,13.770169644534128), (power,11.31389421296813), (constant,9.755477241259648), (oasys,9.755477241259648), (navy,9.755477241259648), (mil,8.369182880139757), (wayne,8.06088152048524), (band,6.982888519019867), (network,6.536601416391448), (thus,6.4075243741163055), (amount,6.2862751304834354), (per,6.063823860262986), (white,5.863656943149022), (frequency,5.770616911879236), (audio,5.681713386737568), (david,5.28829279824546)))