spark关联clickhouse spark关联规则算法

转载

mob64ca140a1f7c 2023-10-01 09:11:34

文章标签 spark关联clickhouse sparkmllib 关联规则 ide ci 文章分类 Spark 大数据

关联规则算法的思想就是找频繁项集，通过频繁项集找强关联。
介绍下基本概念：
对于A->B
1、置信度：P(B|A)，在A发生的事件中同时发生B的概率 p(AB)/P(A) 例如购物篮分析：牛奶 ⇒ 面包
2、支持度：P(A ∩ B)，既有A又有B的概率
假如支持度：3%，置信度：40%
支持度3%：意味着3%顾客同时购买牛奶和面包
置信度40%：意味着购买牛奶的顾客40%也购买面包
3、如果事件A中包含k个元素，那么称这个事件A为k项集事件A满足最小支持度阈值的事件称为频繁k项集。
4、同时满足最小支持度阈值和最小置信度阈值的规则称为强规则

apriori算法的思想

（得出的的强规则要满足给定的最小支持度和最小置信度）

apriori算法的思想是通过k-1项集来推k项集。首先，找出频繁“1项集”的集合，该集合记作L1。L1用于找频繁“2项集”的集合L2，而L2用于找L3。如此下去，直到不能找到“K项集”。找每个Lk都需要一次数据库扫描（这也是它最大的缺点）。

核心思想是：连接步和剪枝步。连接步是自连接，原则是保证前k-2项相同，并按照字典顺序连接。剪枝步，是使任一频繁项集的所有非空子集也必须是频繁的。反之，如果某个候选的非空子集不是频繁的，那么该候选肯定不是频繁的，从而可以将其从CK（频繁项集）中删除。

下面一个比较经典的例子来说明apriori算法的执行步骤：

spark关联clickhouse spark关联规则算法_sparkmllib

上面只计算了频繁项集的支持度，没有计算它的置信度。

基本概念

1. 项与项集
这是一个集合的概念，在一篮子商品中的一件消费品即一项（item），则若干项的集合为项集，如{啤酒，尿布}构成一个二元项集。
2、关联规则
关联规则用亍表示数据内隐含的关联性，例如表示购买了尿布的消费者往往也会购买啤酒。关联性强度如何，由3 个概念，即支持度、置信度、提升度来控制和评价。
3、支持度（support）
支持度是指在所有项集中{X, Y}出现的可能性，即项集中同时含有X 和Y 的概率：
设定最小阈值为5%，由亍{尿布，啤酒}的支持度为800/10000=8%，满足最小阈值要求，成为频繁项集，保留规则；而{尿布，面包}的支持度为100/10000=1%，则被剔除。
4、置信度（confidence）
置信度表示在先决条件X 发生的条件下，关联结果Y 发生的概率：这是生成强关联规则的第二个门槛，衡量了所考察的关联规则在“质”上的可靠性。相似地，我们需要对置信度设定最小阈值（mincon）来实现进一步筛选。
    当设定置信度的最小阈值为70%时，例如{尿布，啤酒}中，购买尿布时会购买啤酒的置信度为800/1000=80%，保留规则；而购买啤酒时会购买尿布的置信度为800/2000=40%，则被剔除。
5. 提升度（lift）
提升度表示在含有X 的条件下同时含有Y 的可能性与没有X 这个条件下项集中含有Y 的可能性之比：公式为置信度(artichok=>cracker)/支持度(cracker)。该指标与置信度同样衡量规则的可靠性，可以看作是置信度的一种互补指标。

FPGrowth 算法

1）扫描事务数据库D 一次。收集频繁项的集合F 和它们的支持度。对F 按支持度降序排序，结果为频繁项
表L。
2）创建FP 树的根节点，以“null”标记它。对亍D 中的每个事务Trans，执行：选择 Trans
中的频繁项，并按L 中的次序排序。设排序后的频繁项表为[p | P]，其中，p 是第一个元素，而P 是剩余元素的表。调用insert_tree([p | P], T)。该过程执行情况如下。如果T 有子节点N 使得N.item-name = p.item-name，则N 的计数增加1；否则创建一个新节点N 将其计数设置为1，链接到它的父节点T，并且通过节点的链结构将其链接到具有相同item-name 的节点中。如果P非空，则递归地调用insert_tree(P, N)。

分析实例

spark关联clickhouse spark关联规则算法_ci_02

spark关联clickhouse spark关联规则算法_关联规则_03

spark关联clickhouse spark关联规则算法_ide_04

spark关联clickhouse spark关联规则算法_ide_05

spark关联clickhouse spark关联规则算法_ci_06

spark关联clickhouse spark关联规则算法_sparkmllib_07

spark关联clickhouse spark关联规则算法_ci_08

源码分析

spark关联clickhouse spark关联规则算法_spark关联clickhouse_09

spark关联clickhouse spark关联规则算法_spark关联clickhouse_10

def run[Item: ClassTag](data: RDD[Array[Item]]): FPGrowthModel[Item] = {
    if (data.getStorageLevel == StorageLevel.NONE) {
      logWarning("Input data is not cached.")
    }
    val count = data.count()
    val minCount = math.ceil(minSupport * count).toLong
    val numParts = if (numPartitions > 0) numPartitions else data.partitions.length
    val partitioner = new HashPartitioner(numParts)
    val freqItems = genFreqItems(data, minCount, partitioner)
    val freqItemsets = genFreqItemsets(data, minCount, freqItems, partitioner)
    new FPGrowthModel(freqItemsets)
  }

private def genFreqItems[Item: ClassTag](
      data: RDD[Array[Item]],
      minCount: Long,
      partitioner: Partitioner): Array[Item] = {
    data.flatMap { t =>
      val uniq = t.toSet
      if (t.length != uniq.size) {
        throw new SparkException(s"Items in a transaction must be unique but got ${t.toSeq}.")
      }
      t
    }.map(v => (v, 1L))
      .reduceByKey(partitioner, _ + _)
      .filter(_._2 >= minCount)
      .collect()
      .sortBy(-_._2)
      .map(_._1)
  }

private def genFreqItemsets[Item: ClassTag](
      data: RDD[Array[Item]],
      minCount: Long,
      freqItems: Array[Item],
      partitioner: Partitioner): RDD[FreqItemset[Item]] = {
    val itemToRank = freqItems.zipWithIndex.toMap
    data.flatMap { transaction =>
      genCondTransactions(transaction, itemToRank, partitioner)
    }.aggregateByKey(new FPTree[Int], partitioner.numPartitions)(
      (tree, transaction) => tree.add(transaction, 1L),
      (tree1, tree2) => tree1.merge(tree2))
    .flatMap { case (part, tree) =>
      tree.extract(minCount, x => partitioner.getPartition(x) == part)
    }.map { case (ranks, count) =>
      new FreqItemset(ranks.map(i => freqItems(i)).toArray, count)
    }
  }

def generateAssociationRules(confidence: Double): RDD[AssociationRules.Rule[Item]] = {
    val associationRules = new AssociationRules(confidence)
    associationRules.run(freqItemsets)
  }

def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]]): RDD[Rule[Item]] = {
    // For candidate rule X => Y, generate (X, (Y, freq(X union Y)))
    val candidates = freqItemsets.flatMap { itemset =>
      val items = itemset.items
      items.flatMap { item =>
        items.partition(_ == item) match {
          case (consequent, antecedent) if !antecedent.isEmpty =>
            Some((antecedent.toSeq, (consequent.toSeq, itemset.freq)))
          case _ => None
        }
      }
    }

    // Join to get (X, ((Y, freq(X union Y)), freq(X))), generate rules, and filter by confidence
    candidates.join(freqItemsets.map(x => (x.items.toSeq, x.freq)))
      .map { case (antecendent, ((consequent, freqUnion), freqAntecedent)) =>
      new Rule(antecendent.toArray, consequent.toArray, freqUnion, freqAntecedent)
    }.filter(_.confidence >= minConfidence)
  }

实例

FP-growth：

import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.rdd.RDD

val data = sc.textFile("data/mllib/sample_fpgrowth.txt")

val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))

val fpg = new FPGrowth()
  .setMinSupport(0.2)
  .setNumPartitions(10)
val model = fpg.run(transactions)

model.freqItemsets.collect().foreach { itemset =>
  println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}

val minConfidence = 0.8
model.generateAssociationRules(minConfidence).collect().foreach { rule =>
  println(
    rule.antecedent.mkString("[", ",", "]")
      + " => " + rule.consequent .mkString("[", ",", "]")
      + ", " + rule.confidence)
}

Association Rules：

import org.apache.spark.mllib.fpm.AssociationRules
import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset

val freqItemsets = sc.parallelize(Seq(
  new FreqItemset(Array("a"), 15L),
  new FreqItemset(Array("b"), 35L),
  new FreqItemset(Array("a", "b"), 12L)
))

val ar = new AssociationRules()
  .setMinConfidence(0.8)
val results = ar.run(freqItemsets)

results.collect().foreach { rule =>
  println("[" + rule.antecedent.mkString(",")
    + "=>"
    + rule.consequent.mkString(",") + "]," + rule.confidence)
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。