基于邻域的算法是个性化推荐系统中最基本的算法,该算法不仅在学术界得到了深入研究,而且在业界得到了广泛应用。基于邻域的算法分为两大类,一类是基于用户的协同过滤算法,另一类是基于物品的协同过滤算法。本文主要研究基于物品的协同过滤算法和基于ALS协同过滤算法。
一、基于物品的协同过滤算法
1.基本思想
ItemCF算法通过计算用户的历史行为记录,来分析物品之间的相似度:如果喜欢物品A的用户大多数也喜欢物品B,那么认为物品A与物品B具有一定的相似度。这就很容易为推荐结果做出合理的解释。例如,如果你购买过《数据挖掘导论》,会向你推荐《机器学习》。
2.相似度度量
如何度量物品间相似度,常用的相似度度量有同现相似度、欧几里得距离、皮尔逊相关系数、余弦相似度、jaccard距离等,具体如下所示。
2.1 同现相似度
同现相似度计算公式如下:
公式中分母是喜欢物品x的用户数,而分子则是同时对物品x和物品y感兴趣的用户数。因此,上述公式可用理解为对物品x感兴趣的用户有多大概率也对y感兴趣 (和关联规则类似)
但上述的公式存在一个问题,如果物品y是热门物品,有很多人都喜欢,则会导致W(x, y)很大,接近于1。因此会造成任何物品都和热门物品交有很大的相似度。
2.2 改进的同现相似度
针对热门物品对同现相似度影响,引入惩罚了物品y的权重,因此减轻了热门物品和很多物品相似的可能性。改进的计算公式如下:
2.3 欧几里得距离
在数学中,欧几里得距离或欧几里得度量是欧几里得空间中两点间“普通”(即直线)距离。使用这个距离,欧氏空间成为度量空间。相关联的范数称为欧几里得范数。较早的文献称之为毕达哥拉斯度量。计算公式如下:
2.4 皮尔逊相关系数
皮尔逊相关系数,即概率论中的相关系数,取值范围[-1,+1]。当大于零时,两个变量正相关,当小于零时表示两个向量负相关。计算公式如下:
2.5 余弦相似度
利用多维空间两点与所设定的点形成夹角的余弦值范围为[-1,1],值越大,说明夹角越大,两点相距就越远,相似度就越小。计算公式如下:
该公式只考虑到了用户的评分,很可能评分较高的物品会排在前面而不管物品的其它信息。
2.6 改进的余弦相似度
考虑到了两个向量相同个体个数、X向量大小、Y向量大小,改进的余弦相似度计算公式如下:
2.7 Jaccard距离
此相似度不考虑评价值,只考虑两个集合共同个体数量。计算公式如下:
3. 预测用户评分公式
通过相似度度量可以得到物品间相似度矩阵Score(i,p),则用户u对物品预测评分计算公式如下:
其中,u为用户,p为物品,ratedItems为用户评价过物品集,r为用户对物品评价分集合。
4. 代码实现
4.1 环境及依赖
java 1.8.0_172+scala 2.11.8+spark.2.3.1
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.3.1</version>
</dependency>
4.2 物品相似度度量代码
4.2.1 改进共现相似度
/**
* 改进的共现相似度
* 共现相似度=numRatersPairs/numRaterPair,同时对A和B感兴趣的用户数/对A感兴趣的用户数
* 描述喜欢A(numRaters)的用户有多大概率对B(numRatersPair)感兴趣,但B是热门物品,导致共现相似度为1
* 改进的共现相似度=numRatersPairs/sqrt(numRaters * numRatersPair),惩罚物品B权重,减轻热门物品和很多相似物品的可能性
* @param numRatersPairs
* @param numRaters
* @param numRatersPair
* @return
*/
def cooccurrence(numRatersPairs:Long,numRaters:Long,numRatersPair:Long):Double = {
numRatersPairs / math.sqrt(numRaters * numRatersPair)
}
4.2.2 皮尔逊相关系数
/**
* 皮尔逊相关系数=变量协方差/标准差
* @param size
* @param dotProduct
* @param ratingSum
* @param ratingPairSum
* @param ratingNorm
* @param ratingNormPair
* @return
*/
def correlation(size:Double,dotProduct:Double,ratingSum:Double,ratingPairSum:Double,ratingNorm:Double,ratingNormPair:Double):Double = {
val numerator = size * dotProduct - ratingSum * ratingPairSum
val denomiator = math.sqrt(size * ratingNorm - ratingSum * ratingSum) * math.sqrt(size * ratingNormPair - ratingPairSum * ratingPairSum)
if(denomiator==0) 0 else numerator/denomiator
}
4.2.3 改进皮尔逊相关系数
/**
* 正则化相关系数
* @param size
* @param dotProduct
* @param ratingSum
* @param ratingPairSum
* @param ratingNorm
* @param ratingNormPair
* @param virtualCount
* @param priorCorrelation
* @return
*/
def regularCorrelation(size:Double,dotProduct:Double,ratingSum:Double,ratingPairSum:Double,ratingNorm:Double,ratingNormPair:Double,virtualCount:Double,priorCorrelation:Double):Double = {
val unregularizedCorrelation = correlation(size, dotProduct, ratingSum, ratingPairSum, ratingNorm, ratingNormPair)
val w = size/(size + virtualCount)
w * unregularizedCorrelation + (1 - w) * priorCorrelation
}
4.2.4 余弦相似度
/**
* 余弦相似度
* @param dotProduct
* @param ratingNorm
* @param ratingNormPair
* @return
*/
def cosineSimilarity(dotProduct:Double, ratingNorm:Double, ratingNormPair:Double):Double = {
dotProduct/(ratingNorm * ratingNormPair)
}
4.2.5 改进的余弦相似度
/**
* 改进的余弦相似度
* 考虑两个向量相同个体个数,A向量大小和B向量大小
* @param dotProduct
* @param ratingNorm
* @param ratingNormPair
* @param numPairs
* @param num
* @param numPair
* @return
*/
def improvedCosineSimilarity(dotProduct:Double,ratingNorm:Double,ratingNormPair:Double,numPairs:Long,num:Long,numPair:Long):Double = {
dotProduct * numPairs / (ratingNorm * ratingNormPair * num * math.log10(10 + numPair))
}
4.2.6 Jaccard距离
/**
* Jaccard相似度
* @param size
* @param numRaters
* @param numRatersPair
* @return
*/
def jaccardSimilarity(size:Double,numRaters:Double,numRatersPair:Double):Double = {
size/(numRaters + numRatersPair - size)
}
4.3 预测用户评分代码
4.3.1 计算物品集相似度矩阵
依次计算改进共现相似度、皮尔逊系数、改进皮尔逊系数、余弦相似度、改进余弦相似度、Jaccard距离,并引入以上度量加权系数和度量,这里的加权系数自定义为coef = (0.1,0.1,0.1,0.2,0.3,0.1)。其中余弦相似度和改进余弦相似度越小,描述物品的相似度越大,因此加权系数和度量将余弦相似度和改进余弦相似度作取负处理。
import spark.implicits._
val rating = spark.read.textFile(path).map(parseRating(_)).toDF()
rating.show(10, false)
// 每个用户评分最高的top10物品
val userRecs = rating.select($"userId", $"movieId", $"rating", functions.row_number().over(Window.partitionBy("userId").orderBy("rating")).alias("rank"))
.filter($"rank" <= 10)
userRecs.show(10, false)
// 获取每个物品评分用户数,item2manyUser格式如下(movieId,numRaters)
// rating.groupBy($"movieId").pivot("rating").count().show(false)
val item2manyUser = rating.groupBy($"movieId").count().toDF("movieId", "numRaters").coalesce(defaultParallelism)
item2manyUser.show(10, false)
// 获取用户对物品评分及评分物品数量,ratingWithSize和ratingWithSizePair格式如下(userId,movieId,rating,timestamp,numRaters)
val ratingsWithSize = rating.join(item2manyUser, "movieId").coalesce(defaultParallelism)
ratingsWithSize.show(10, false)
val ratingsWithSizePair = ratingsWithSize.toDF("movieIdPair", "userId", "ratingPair", "timestampPair", "numRatersPair")
// 获取用户对不同物品的评分的矩阵,并过滤相同item pairs
val ratingPairs = ratingsWithSize.join(ratingsWithSizePair, "userId").where($"movieId" < $"movieIdPair")
.selectExpr("userId", "movieId", "rating", "numRaters", "movieIdPair", "ratingPair", "numRatersPair", "rating * ratingPair as product", "pow(rating,2) as ratingPow", "pow(ratingPair,2) as ratingPairPow")
.coalesce(defaultParallelism)
ratingPairs.show(10, false)
// 计算item pairs的相似度统计量
val vectorCals = ratingPairs.groupBy("movieId", "movieIdPair")
.agg(functions.count("userId").alias("size"),
functions.sum("product").alias("dotProduct"),
functions.sum("rating").alias("ratingSum"),
functions.sum("ratingPair").alias("ratingPairSum"),
functions.sum("ratingPow").alias("ratingPowSum"),
functions.sum("ratingPairPow").alias("ratingPairPowSum"),
functions.first("numRaters").alias("numRaters"),
functions.first("numRatersPair").alias("numRatersPair"))
// .agg(Map("userId"->"count","product"->"sum","rating"->"sum","ratingPair"->"sum","ratingPow"->"sum","ratingPairPow"->"sum","numRaters"->"first","numRatersPair"->"first"))
// .toDF("movieId","movieIdPair","size","dotProduct","ratingSum","ratingPairSum","ratingPowSum","ratingPairPowSum","numRaters","numRatersPair")
.coalesce(defaultParallelism)
vectorCals.show(10, false)
// 计算item pairs的相似度度量(包括:共现相似度、改进共现相似度、皮尔逊系数、改进皮尔逊系数、余弦相似度、改进余弦相似度、Jaccard距离)
val similar = vectorCals.map(row => {
val movieId = row.getAs[Int]("movieId")
val movieIdPair = row.getAs[Int]("movieIdPair")
val size = row.getAs[Long]("size")
val dotProduct = row.getAs[Double]("dotProduct")
val ratingSum = row.getAs[Double]("ratingSum")
val ratingPairSum = row.getAs[Double]("ratingPairSum")
val ratingPowSum = row.getAs[Double]("ratingPowSum")
val ratingPairPowSum = row.getAs[Double]("ratingPairPowSum")
val numRaters = row.getAs[Long]("numRaters")
val numRatersPair = row.getAs[Long]("numRatersPair")
val cooc = cooccurrence(size, numRaters, numRatersPair)
val corr = correlation(size, dotProduct, ratingSum, ratingPairSum, ratingPowSum, ratingPairPowSum)
val regCorr = regularCorrelation(size, dotProduct, ratingSum, ratingPairSum, ratingPowSum, ratingPairPowSum, PRIOR_COUNT, PRIOR_CORRELATION)
val cos = cosineSimilarity(dotProduct, math.sqrt(ratingPowSum), math.sqrt(ratingPairPowSum))
val impCos = improvedCosineSimilarity(dotProduct, math.sqrt(ratingPowSum), math.sqrt(ratingPairPowSum), size, numRaters, numRatersPair)
val jac = jaccardSimilarity(size, numRaters, numRatersPair)
val score = coef(0)*cooc + coef(1)*corr + coef(2)*regCorr - coef(3)*cos - coef(4)*impCos + coef(5)*jac
(movieId, movieIdPair, cooc, corr, regCorr, cos, impCos, jac, score)
}).toDF("movieId", "movieIdPair", "cooc", "corr", "regCorr", "cos", "impCos", "jac", "score")
similar.show(10, false)
// 半角矩阵反转,计算所有item pairs相似度度量
val similarities = similar.withColumnRenamed("movieId", "movieIdRe")
.withColumnRenamed("movieIdPair", "movieId")
.withColumnRenamed("movieIdRe","movieIdPair")
.union(similar)
.repartition(defaultParallelism)
similarities.show(10, false)
val ItemPairSim = similarities.groupBy("movieId","movieIdPair").agg(
functions.sum("cooc").alias("coocSim"),
functions.sum("corr").alias("corrSim"),
functions.sum("regCorr").alias("regCorrSim"),
functions.sum("cos").alias("cosSim"),
functions.sum("impCos").alias("impCosSim"),
functions.sum("jac").alias("jacSim"),
functions.sum("score").alias("scores")
)
val simCols = Array("coocSim","corrSim","regCorrSim","cosSim","impCosSim","jacSim","scores")
simCols.map(simCol =>{
simCol match{
case "coocSim" => println("共现相似度:")
case "corrSim" => println("皮尔逊相关系数:")
case "regCorrSim" => println("改进皮尔逊相关系数:")
case "cosSim" => println("余弦相似度:")
case "impCosSim" => println("改进的余弦相似度:")
case "jacSim" => println("Jaccard相似度:")
case _ => println("加权相似度:")
}
val itemPairsCol = if(simCol.equals("cosSim")||simCol.equals("impCosSim")){
ItemPairSim.select( $"movieId", $"movieIdPair", functions.row_number().over(Window.partitionBy("movieId").orderBy(simCol)).alias("rank"))
.filter($"rank" <= 10)
}else{
ItemPairSim.select( $"movieId", $"movieIdPair", functions.row_number().over(Window.partitionBy("movieId").orderBy(functions.desc(simCol))).alias("rank"))
.filter($"rank" <= 10)
}
println(itemPairsCol.select("movieId", "movieIdPair").where($"movieId" === 15).collectAsList())
})
ItemPairSim.where($"movieId" === 15).show(10,false)
4.3.2 计算用户对物品预测评分
在3 预测用户评分中,选取用户对所有物品评分最高的top10作为最终的推荐结果。
// 用户评分与item pairs连接
val userRating = rating.join(similar, "movieId")
.selectExpr("userId", "movieId", "movieIdPair", "cooc", "cooc * rating as coocMeasure",
"corr", "corr * rating as corrMeasure", "regCorr", "regCorr * rating as regCorrMeasure",
"cos", "cos * rating as cosMeasure", "impCos", "impCos * rating as impCosMeasure", "jac", "jac * rating as jacMeasure", "score", "score * rating as scoreMeasure")
.coalesce(defaultParallelism)
userRating.show(10, false)
// 用户对所有物品评分预测
val userScore = userRating.groupBy("userId", "movieIdPair")
.agg(functions.sum("cooc").alias("coocSum"),
functions.sum("coocMeasure").alias("coocMeasureSum"),
functions.sum("corr").alias("corrSum"),
functions.sum("corrMeasure").alias("corrMeasureSum"),
functions.sum("regCorr").alias("regCorrSum"),
functions.sum("regCorrMeasure").alias("regCorrMeasureSum"),
functions.sum("cos").alias("cosSum"),
functions.sum("cosMeasure").alias("cosMeasureSum"),
functions.sum("impCos").alias("impCosSum"),
functions.sum("impCosMeasure").alias("impCosMeasureSum"),
functions.sum("jac").alias("jacSum"),
functions.sum("jacMeasure").alias("jacMeasureSum"),
functions.sum("score").alias("score"),
functions.sum("scoreMeasure").alias("scoreMeasure")
)
.selectExpr("userId", "movieIdPair", "coocSum/coocMeasureSum as coocScore",
"corrSum/corrMeasureSum as corrScore", "regCorrSum/regCorrMeasureSum as regCorrScore",
"cosSum/cosMeasureSum as cosScore", "impCosSum/impCosMeasureSum as impCosScore",
"jacSum/jacMeasureSum as jacScore","score/scoreMeasure as scores")
.coalesce(defaultParallelism)
// 选取每个用户评分最高的10个商品
val userRanks = userScore
.select($"userId", $"movieIdPair", $"scores", functions.row_number().over(Window.partitionBy("userId").orderBy(functions.desc("scores"))).alias("rank"))
.filter($"rank" <= RANKS)
val userRecommend = userRanks.select($"userId", functions.concat_ws(":", $"movieIdPair", $"scores").alias("recommend"))
.groupBy("userId")
.agg(functions.collect_set("recommend"))
userRecommend.show(10, false)
4.4 运行结果
4.4.1 计算物品相似度矩阵
+------+-------+------+----------+
|userId|movieId|rating|timestamp |
+------+-------+------+----------+
|0 |2 |3.0 |1424380312|
|0 |3 |1.0 |1424380312|
|0 |5 |2.0 |1424380312|
|0 |9 |4.0 |1424380312|
|0 |11 |1.0 |1424380312|
|0 |12 |2.0 |1424380312|
|0 |15 |1.0 |1424380312|
|0 |17 |1.0 |1424380312|
|0 |19 |1.0 |1424380312|
|0 |21 |1.0 |1424380312|
+------+-------+------+----------+
only showing top 10 rows+------+-------+------+----+
|userId|movieId|rating|rank|
+------+-------+------+----+
|28 |1 |1.0 |1 |
|28 |3 |1.0 |2 |
|28 |6 |1.0 |3 |
|28 |7 |1.0 |4 |
|28 |14 |1.0 |5 |
|28 |15 |1.0 |6 |
|28 |17 |1.0 |7 |
|28 |20 |1.0 |8 |
|28 |27 |1.0 |9 |
|28 |29 |1.0 |10 |
+------+-------+------+----+
only showing top 10 rows+-------+---------+
|movieId|numRaters|
+-------+---------+
|31 |15 |
|85 |18 |
|65 |11 |
|53 |12 |
|78 |14 |
|34 |11 |
|81 |16 |
|28 |12 |
|76 |11 |
|26 |14 |
+-------+---------+
only showing top 10 rows+-------+------+------+----------+---------+
|movieId|userId|rating|timestamp |numRaters|
+-------+------+------+----------+---------+
|2 |0 |3.0 |1424380312|19 |
|3 |0 |1.0 |1424380312|13 |
|5 |0 |2.0 |1424380312|13 |
|9 |0 |4.0 |1424380312|16 |
|11 |0 |1.0 |1424380312|12 |
|12 |0 |2.0 |1424380312|17 |
|15 |0 |1.0 |1424380312|19 |
|17 |0 |1.0 |1424380312|13 |
|19 |0 |1.0 |1424380312|17 |
|21 |0 |1.0 |1424380312|17 |
+-------+------+------+----------+---------+
only showing top 10 rows+------+-------+------+---------+-----------+----------+-------------+-------+---------+-------------+
|userId|movieId|rating|numRaters|movieIdPair|ratingPair|numRatersPair|product|ratingPow|ratingPairPow|
+------+-------+------+---------+-----------+----------+-------------+-------+---------+-------------+
|28 |0 |3.0 |16 |1 |1.0 |13 |3.0 |9.0 |1.0 |
|28 |0 |3.0 |16 |2 |4.0 |19 |12.0 |9.0 |16.0 |
|28 |0 |3.0 |16 |3 |1.0 |13 |3.0 |9.0 |1.0 |
|28 |0 |3.0 |16 |6 |1.0 |20 |3.0 |9.0 |1.0 |
|28 |0 |3.0 |16 |7 |1.0 |16 |3.0 |9.0 |1.0 |
|28 |0 |3.0 |16 |12 |5.0 |17 |15.0 |9.0 |25.0 |
|28 |0 |3.0 |16 |13 |2.0 |16 |6.0 |9.0 |4.0 |
|28 |0 |3.0 |16 |14 |1.0 |18 |3.0 |9.0 |1.0 |
|28 |0 |3.0 |16 |15 |1.0 |19 |3.0 |9.0 |1.0 |
|28 |0 |3.0 |16 |17 |1.0 |13 |3.0 |9.0 |1.0 |
+------+-------+------+---------+-----------+----------+-------------+-------+---------+-------------+
only showing top 10 rows+-------+-----------+----+----------+---------+-------------+------------+----------------+---------+-------------+
|movieId|movieIdPair|size|dotProduct|ratingSum|ratingPairSum|ratingPowSum|ratingPairPowSum|numRaters|numRatersPair|
+-------+-----------+----+----------+---------+-------------+------------+----------------+---------+-------------+
|3 |57 |4 |8.0 |4.0 |8.0 |4.0 |20.0 |13 |12 |
|3 |89 |3 |8.0 |3.0 |8.0 |3.0 |24.0 |13 |11 |
|27 |65 |5 |33.0 |11.0 |11.0 |37.0 |35.0 |15 |11 |
|36 |83 |8 |18.0 |14.0 |12.0 |30.0 |32.0 |18 |14 |
|52 |58 |8 |32.0 |15.0 |12.0 |43.0 |26.0 |14 |15 |
|58 |81 |10 |23.0 |15.0 |18.0 |33.0 |48.0 |15 |16 |
|63 |81 |9 |24.0 |15.0 |16.0 |31.0 |44.0 |16 |16 |
|7 |55 |8 |14.0 |13.0 |9.0 |35.0 |11.0 |16 |19 |
|18 |68 |10 |56.0 |26.0 |19.0 |82.0 |51.0 |15 |19 |
|18 |95 |8 |32.0 |19.0 |15.0 |57.0 |35.0 |15 |17 |
+-------+-----------+----+----------+---------+-------------+------------+----------------+---------+-------------+
only showing top 10 rows+-------+-----------+-------------------+--------------------+--------------------+------------------+-------------------+-------------------+--------------------+
|movieId|movieIdPair|cooc |corr |regCorr |cos |impCos |jac |score |
+-------+-----------+-------------------+--------------------+--------------------+------------------+-------------------+-------------------+--------------------+
|3 |57 |0.32025630761017426|0.0 |0.0 |0.8944271909999159|0.20500872816969473|0.19047619047619047|-0.1702671877946361 |
|3 |89 |0.2508726030021272 |0.0 |0.0 |0.9428090415820635|0.1645501000890719 |0.14285714285714285|-0.18426814947149298|
|27 |65 |0.3892494720807615 |0.7484551991837488 |0.24948506639458293 |0.9170205237216019|0.23118215648843166|0.23809523809523808|-0.06642073030589295|
|36 |83 |0.5039526306789696 |-0.3418817293789138 |-0.15194743527951723|0.5809475019311126|0.18707200893898498|0.3333333333333333 |-0.10463208979919748|
|52 |58 |0.5520524474738834 |0.8708635721768008 |0.38705047652302255 |0.9570377672873267|0.3912032853853935 |0.38095238095238093|-0.05158141326523652|
|58 |81 |0.6454972243679028 |-0.3125381539589969 |-0.15626907697949846|0.577896743774047 |0.2722768569470682 |0.47619047619047616|-0.08435531125789389|
|63 |81 |0.5625 |-0.27602622373694163|-0.1307492638753934 |0.6498364332886588|0.25833206982242696|0.391304347826087 |-0.11363358680047596|
|7 |55 |0.4588314677411235 |-0.17937400083354382|-0.07972177814824169|0.7135060680126758|0.2439507128147666 |0.2962962962962963 |-0.1366535993117721 |
|18 |68 |0.5923488777590923 |0.45057755628547236 |0.22528877814273618 |0.8659563730239938|0.3947654807460637 |0.4166666666666667 |-0.08146606427655445|
|18 |95 |0.5009794328681196 |-0.40119438904232335|-0.1783086173521437 |0.7164378605434321|0.26694834804228634|0.3333333333333333 |-0.1645577672073404 |
+-------+-----------+-------------------+--------------------+--------------------+------------------+-------------------+-------------------+--------------------+
only showing top 10 rows+-----------+-------+-------------------+--------------------+--------------------+------------------+-------------------+-------------------+--------------------+
|movieIdPair|movieId|cooc |corr |regCorr |cos |impCos |jac |score |
+-----------+-------+-------------------+--------------------+--------------------+------------------+-------------------+-------------------+--------------------+
|25 |51 |0.5976143046671968 |-0.45014069095231984|-0.22507034547615992|0.6819309069874763|0.32975864603850563|0.4166666666666667 |-0.1597401150518419 |
|47 |87 |0.3450327796711771 |0.0 |0.0 |0.9233805168766386|0.2359033677994821 |0.20833333333333334|-0.17927716908138797|
|7 |96 |0.5809475019311126 |-0.48176685558046484|-0.22820535264337807|0.5507685172519937|0.22161701434423176|0.4090909090909091 |-0.1077230965647595 |
|56 |76 |0.5118906968889915 |-0.6454972243679029 |-0.26579297473972474|0.7105597124064275|0.22128206127090347|0.3333333333333333 |-0.1817698444177535 |
|32 |43 |0.4472135954999579 |0.090075469822209 |0.03377830118332838 |0.8549090976340066|0.30577460131716266|0.2857142857142857 |-0.14846460612854348|
|10 |54 |0.3757345746510897 |0.0 |0.0 |0.9525793444156805|0.2662018889308347 |0.23076923076923078|-0.18664913194343136|
|12 |75 |0.3500700210070024 |-0.5103103630798287 |-0.1701034543599429 |0.5011148285857957|0.10979158531474496|0.20833333333333334|-0.12452815428819287|
|9 |66 |0.375 |0.0 |0.0 |0.8725028717782317|0.23123303162285763|0.23076923076923078|-0.1602166376886575 |
|32 |34 |0.26111648393354675|0.0 |0.0 |0.8164965809277261|0.15437994744510897|0.15 |-0.15350165202572327|
|86 |92 |0.44095855184409843|0.16666666666666669 |0.06862745098039216 |0.903696114115064 |0.2546257899447296 |0.28 |-0.13350169285731595|
+-----------+-------+-------------------+--------------------+--------------------+------------------+-------------------+-------------------+--------------------+
only showing top 10 rows共现相似度:
[[15,5], [15,7], [15,4], [15,6], [15,14], [15,9], [15,12], [15,10], [15,2], [15,1]]
皮尔逊相关系数:
[[15,8], [15,12], [15,7], [15,4], [15,11], [15,3], [15,2], [15,1], [15,0], [15,9]]
改进皮尔逊相关系数:
[[15,12], [15,7], [15,4], [15,8], [15,11], [15,3], [15,2], [15,1], [15,0], [15,9]]
余弦相似度:
[[15,9], [15,10], [15,1], [15,13], [15,6], [15,2], [15,0], [15,11], [15,7], [15,12]]
改进的余弦相似度:
[[15,0], [15,13], [15,2], [15,11], [15,10], [15,9], [15,1], [15,6], [15,12], [15,3]]
Jaccard相似度:
[[15,5], [15,7], [15,6], [15,4], [15,14], [15,9], [15,12], [15,10], [15,2], [15,1]]
加权相似度:
[[15,7], [15,12], [15,4], [15,9], [15,2], [15,1], [15,8], [15,6], [15,14], [15,11]]
+-------+-----------+------------------+--------------------+--------------------+------------------+------------------+-------------------+--------------------+
|movieId|movieIdPair|coocSim |corrSim |regCorrSim |cosSim |impCosSim |jacSim |scores |
+-------+-----------+------------------+--------------------+--------------------+------------------+------------------+-------------------+--------------------+
|15 |14 |1.2977713690461004|-0.4633323746250466 |-0.25272674979547993|1.6979054399120357|0.7740279743049593|0.96 |-0.32161825581133757|
|15 |12 |1.224112744964246 |0.663663648395968 |0.34763333963598325 |1.6799278063066676|0.7433079855995841|0.88 |-0.159436983641589 |
|15 |1 |1.1453125733564 |-0.25 |-0.11842105263157894|1.5320646925708532|0.725288309546159 |0.782608695652174 |-0.28978854017510147|
|15 |5 |1.399826478546711 |-0.43852900965351466|-0.22970567172326958|1.7172593257387583|0.993618416740892 |1.0476190476190477 |-0.35885440092921705|
|15 |6 |1.3337718577107005|-0.5564866749122019 |-0.3145359466895054 |1.6412198797244364|0.7294819353921144|1.0 |-0.3008136329516223 |
|15 |2 |1.1578947368421053|-0.1437770309379179 |-0.07531177811033796|1.6470588235294117|0.6520525690591884|0.8148148148148148 |-0.26818397968129093|
|15 |8 |0.8671099695241199|0.8164965809277261 |0.2721655269759087 |1.6853174301284732|0.8231672678073898|0.47619047619047616|-0.2931983633870409 |
|15 |11 |0.9271726499455306|0.0 |0.0 |1.6630436812405998|0.6633685326776863|0.5833333333333334 |-0.32223536439020606|
|15 |10 |1.1846977555181846|-0.6965260331469925 |-0.34826301657349623|1.5212174611483278|0.6934808277609235|0.8333333333333334 |-0.33163020331150633|
|15 |7 |1.3764944032233704|0.6112274566280462 |0.3333967945243888 |1.6711454971746993|0.8570574663543986|1.0434782608695652 |-0.15053882172976585|
+-------+-----------+------------------+--------------------+--------------------+------------------+------------------+-------------------+--------------------+
only showing top 10 rows
4.4.2 计算用户对物品预测评分
+------+-------+-----------+-------------------+-------------------+----+-----------+-------+--------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
|userId|movieId|movieIdPair|cooc |coocMeasure |corr|corrMeasure|regCorr|regCorrMeasure|cos |cosMeasure |impCos |impCosMeasure |jac |jacMeasure |score |scoreMeasure |
+------+-------+-----------+-------------------+-------------------+----+-----------+-------+--------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
|29 |3 |57 |0.32025630761017426|0.32025630761017426|0.0 |0.0 |0.0 |0.0 |0.8944271909999159|0.8944271909999159|0.20500872816969473|0.20500872816969473|0.19047619047619047|0.19047619047619047|-0.1702671877946361|-0.1702671877946361|
|28 |3 |57 |0.32025630761017426|0.32025630761017426|0.0 |0.0 |0.0 |0.0 |0.8944271909999159|0.8944271909999159|0.20500872816969473|0.20500872816969473|0.19047619047619047|0.19047619047619047|-0.1702671877946361|-0.1702671877946361|
|26 |3 |57 |0.32025630761017426|0.32025630761017426|0.0 |0.0 |0.0 |0.0 |0.8944271909999159|0.8944271909999159|0.20500872816969473|0.20500872816969473|0.19047619047619047|0.19047619047619047|-0.1702671877946361|-0.1702671877946361|
|22 |3 |57 |0.32025630761017426|0.6405126152203485 |0.0 |0.0 |0.0 |0.0 |0.8944271909999159|1.7888543819998317|0.20500872816969473|0.41001745633938946|0.19047619047619047|0.38095238095238093|-0.1702671877946361|-0.3405343755892722|
|21 |3 |57 |0.32025630761017426|0.32025630761017426|0.0 |0.0 |0.0 |0.0 |0.8944271909999159|0.8944271909999159|0.20500872816969473|0.20500872816969473|0.19047619047619047|0.19047619047619047|-0.1702671877946361|-0.1702671877946361|
|17 |3 |57 |0.32025630761017426|0.32025630761017426|0.0 |0.0 |0.0 |0.0 |0.8944271909999159|0.8944271909999159|0.20500872816969473|0.20500872816969473|0.19047619047619047|0.19047619047619047|-0.1702671877946361|-0.1702671877946361|
|14 |3 |57 |0.32025630761017426|0.9607689228305227 |0.0 |0.0 |0.0 |0.0 |0.8944271909999159|2.6832815729997477|0.20500872816969473|0.6150261845090842 |0.19047619047619047|0.5714285714285714 |-0.1702671877946361|-0.5108015633839083|
|13 |3 |57 |0.32025630761017426|0.32025630761017426|0.0 |0.0 |0.0 |0.0 |0.8944271909999159|0.8944271909999159|0.20500872816969473|0.20500872816969473|0.19047619047619047|0.19047619047619047|-0.1702671877946361|-0.1702671877946361|
|9 |3 |57 |0.32025630761017426|0.32025630761017426|0.0 |0.0 |0.0 |0.0 |0.8944271909999159|0.8944271909999159|0.20500872816969473|0.20500872816969473|0.19047619047619047|0.19047619047619047|-0.1702671877946361|-0.1702671877946361|
|8 |3 |57 |0.32025630761017426|0.6405126152203485 |0.0 |0.0 |0.0 |0.0 |0.8944271909999159|1.7888543819998317|0.20500872816969473|0.41001745633938946|0.19047619047619047|0.38095238095238093|-0.1702671877946361|-0.3405343755892722|
+------+-------+-----------+-------------------+-------------------+----+-----------+-------+--------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
only showing top 10 rows+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|userId|collect_set(recommend) |
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|28 |[57:0.6823095186850562, 81:0.7074190948255319, 82:0.672283640715388, 12:0.636297470883653, 92:0.647256075399604, 38:0.6347904579454706, 49:0.6254644038108939, 89:0.6941355224210544, 80:0.6154978795267612, 40:0.6156027460005857]|
|26 |[3:1.0, 73:0.5968684282344038, 6:0.6647065655763444, 18:0.5799619983007535, 1:1.0, 2:1.0, 21:0.5799845694216877, 5:0.8498976850090274, 7:0.5942352082433142, 4:1.0] |
|27 |[9:1.0, 18:1.0, 10:1.0, 3:1.0, 13:1.0, 11:1.0, 8:1.0, 6:1.0, 2:1.0, 4:1.0] |
|12 |[7:1.0, 14:0.8059733660850584, 5:1.0, 3:1.0, 10:0.8293788719613415, 15:0.8349960681296311, 16:0.7700888581918046, 6:1.0, 4:1.0, 13:0.8092037847999061] |
|22 |[21:0.7304053132054968, 14:0.7261359695697873, 22:0.7789954217249625, 3:1.0, 15:0.7439260230313952, 24:0.7289206702134335, 18:0.7695442506930248, 2:1.0, 1:1.0, 17:0.7492600967934236] |
|1 |[56:0.7346282753379684, 21:0.761510986584304, 68:0.7437219860490558, 20:0.7619575246355602, 77:0.7191257119486413, 4:0.7228603942815985, 19:0.7429016915005017, 28:0.7400829561035526, 86:0.717171677819262, 62:0.7218697862031331]|
|13 |[14:0.8650749933539622, 7:0.8879325873718314, 3:1.0, 8:0.904542655291971, 5:0.903713357971985, 12:0.893690744991781, 2:1.0, 1:1.0, 11:0.9145180452679533, 4:1.0] |
|6 |[40:0.8284063212375322, 43:0.8278553890046102, 12:0.8405217298206186, 25:0.8331053495744748, 14:0.9206849112675594, 42:0.856004729785749, 61:0.8413345199040396, 1:1.0, 2:1.0, 58:0.8419910885809283] |
|16 |[28:0.7777501297286809, 5:1.0, 3:1.0, 22:0.7936015217150886, 51:0.767085271204479, 45:0.7314911580677125, 18:0.767613647131584, 21:0.8171576034323195, 24:0.8137831859252895, 4:1.0] |
|3 |[22:0.7082578422647425, 7:1.0, 5:1.0, 3:1.0, 8:0.7233338418664426, 6:1.0, 1:1.0, 2:1.0, 4:1.0, 18:0.7594820850748755] |
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows
二、基于ALS的协同过滤算法
1.基本思想
通过观察所有用户给产品的打分,来推断每个用户的喜好并向用户推荐合适产品。不像基于用户或者物品的协同过滤算法,通过计算相似度来对评分预测和推荐,而是通过矩阵分解方法来进行预测。
2.交替最小二乘求解ALS
用户评分矩阵的每一行代表一个用户
,每一列代表一个物品
,矩阵中的每一个元素代表用户对物品的评分。ALS的核心假设为:打分矩阵A是近似低秩的,即一个m * n的打分矩阵A可以用两个小矩阵
和
乘积来近似:
打分矩阵
就可以由用户喜好特征矩阵
和产品特征矩阵
表示。
为了找到使低秩矩阵U和V尽可能逼近A,需要最小化平方误差损失函数:
损失函数一般需要加入正则化项来避免过拟合问题,使用L2正则化,则改造为:
把协同过滤问题转化为优化问题,求解采用交替最小二乘(ALS)。
3. 代码实现
环境和依赖同上。
3.1 建立ALS协同过滤模型及预测
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val model = als.fit(training)
model.setColdStartStrategy("drop")
val predictions = model.transform(test)
3.2 模型评估
val evaluator = new RegressionEvaluator()
.setMetricName("rmse")
.setLabelCol("rating")
.setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictions)
println(s"Root-mean-square error = $rmse")
3.3 推荐列表
// 为每个用户生成前10个电影推荐
val userRecs = model.recommendForAllUsers(10)
userRecs.show(10, false)
// 为每部电影生成前10个用户推荐
val movieRecs = model.recommendForAllItems(10)
movieRecs.show(10,false)
// 为指定的一组用户生成前10个电影推荐
val users = ratings.select(als.getUserCol).distinct().limit(3)
val userSubsetRecs = model.recommendForUserSubset(users, 10)
userSubsetRecs.show(10, false)
// 为指定的一组电影生成前10个用户推荐
val movies = ratings.select(als.getItemCol).distinct().limit(3)
val movieSubSetRecs = model.recommendForItemSubset(movies, 10)
movieSubSetRecs.show(5, false)
}
case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
def parseRating(str: String): Rating = {
val fields = str.split("::")
assert(fields.size == 4)
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
4 运行结果
Root-mean-square error = 1.927040678568387
+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|userId|recommendations |
+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|28 |[[92, 5.190725], [81, 4.86317], [4, 4.7198944], [69, 4.33059], [29, 4.2829266], [89, 4.2616525], [76, 4.1987777], [96, 4.184493], [7, 4.1226907], [2, 4.0815644]] |
|26 |[[51, 6.058535], [30, 5.7161875], [94, 5.0318503], [88, 4.9334908], [7, 4.8731565], [24, 4.6401963], [55, 4.472246], [53, 4.236663], [77, 4.1913238], [68, 4.061715]] |
|27 |[[38, 4.2577467], [46, 4.063789], [30, 3.945347], [18, 3.7809763], [23, 3.7044308], [17, 3.3986986], [69, 3.2548237], [27, 3.2114499], [1, 3.1994212], [83, 3.1642542]] |
|12 |[[25, 5.5946026], [46, 5.5124335], [17, 5.11423], [35, 5.0823307], [64, 5.0337462], [27, 4.959539], [43, 4.428263], [1, 4.240394], [94, 3.9515357], [31, 3.9412215]] |
|22 |[[53, 5.5407586], [75, 5.1068535], [46, 5.0826797], [22, 5.047842], [74, 4.9705014], [52, 4.8736644], [88, 4.8529797], [87, 4.850308], [30, 4.643672], [51, 4.5106797]] |
|1 |[[62, 3.5494595], [10, 3.4318256], [68, 3.4186132], [49, 3.3213418], [92, 3.2935987], [85, 3.0727763], [77, 2.9407728], [9, 2.8703325], [39, 2.7454283], [55, 2.435999]]|
|13 |[[32, 4.720694], [69, 4.131916], [93, 3.9207232], [96, 3.6651685], [62, 3.439321], [4, 3.4145846], [74, 3.3329194], [53, 3.194853], [30, 2.9827442], [92, 2.9585018]] |
|6 |[[25, 4.8636093], [58, 3.809049], [62, 3.5082598], [43, 3.4347136], [40, 3.1993797], [37, 3.172602], [92, 3.0829635], [64, 3.0454423], [52, 2.998352], [95, 2.9680624]] |
|16 |[[90, 5.2693963], [85, 4.881312], [54, 4.7646246], [51, 4.623986], [1, 4.4009867], [33, 3.4036496], [68, 3.381634], [94, 3.1365643], [47, 3.060517], [10, 2.9913652]] |
|3 |[[51, 5.0008397], [88, 3.9850216], [24, 3.3814678], [57, 3.186227], [97, 3.106503], [94, 3.0213935], [74, 3.0056891], [76, 2.9751751], [29, 2.9741306], [87, 2.9737644]]|
+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|movieId|recommendations |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|31 |[[12, 3.9412215], [8, 3.2569442], [7, 2.8316283], [6, 2.5838141], [15, 2.1348748], [22, 2.0737545], [21, 2.0215204], [25, 1.8854598], [29, 1.4713752], [14, 1.3172331]]|
|85 |[[16, 4.881312], [17, 4.2706757], [14, 4.2134066], [7, 3.6982865], [1, 3.0727763], [15, 2.7946656], [19, 2.5757656], [6, 2.2734714], [20, 2.2645016], [3, 2.1871538]] |
|65 |[[23, 4.6813827], [20, 4.5017395], [25, 3.6871593], [14, 3.652615], [22, 3.0776486], [7, 2.9361606], [6, 2.4191651], [5, 2.20444], [0, 2.1689785], [3, 2.026729]] |
|53 |[[22, 5.5407586], [21, 4.9554963], [8, 4.9435635], [24, 4.7219305], [26, 4.236663], [13, 3.194853], [5, 2.8868222], [20, 2.8620806], [27, 2.6194685], [28, 2.595715]] |
|78 |[[5, 1.3975992], [23, 1.3725746], [25, 1.2985932], [18, 1.2398205], [6, 1.1964377], [29, 1.1187494], [7, 1.1098578], [2, 1.0951111], [24, 1.0589423], [13, 1.04663]] |
|34 |[[14, 4.7571654], [23, 4.189182], [2, 4.0316715], [18, 3.776127], [28, 3.2931669], [25, 2.9498177], [20, 2.885182], [3, 2.7920961], [13, 2.5485854], [0, 2.4903808]] |
|81 |[[28, 4.86317], [11, 4.0060267], [23, 3.2592351], [18, 3.1237376], [14, 2.7114706], [13, 2.612024], [10, 2.527891], [2, 2.2649016], [9, 2.2159412], [24, 2.1401815]] |
|28 |[[12, 2.1206188], [7, 2.0479355], [6, 1.9925008], [15, 1.5918586], [25, 1.5583891], [8, 1.5504173], [14, 1.2316033], [0, 1.2297935], [29, 1.2055482], [5, 1.1394241]] |
|76 |[[28, 4.1987777], [14, 3.4017448], [10, 3.2845893], [3, 2.9751751], [0, 2.9261422], [12, 2.8340385], [18, 2.7923949], [7, 2.764282], [6, 2.289724], [16, 2.189875]] |
|26 |[[12, 3.3199286], [11, 3.3151531], [15, 2.6134777], [29, 2.49589], [0, 2.153885], [25, 2.1172814], [27, 1.9949272], [18, 1.3374854], [20, 1.3024286], [7, 1.1732591]] |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|userId|recommendations |
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|28 |[[92, 5.190725], [81, 4.86317], [4, 4.7198944], [69, 4.33059], [29, 4.2829266], [89, 4.2616525], [76, 4.1987777], [96, 4.184493], [7, 4.1226907], [2, 4.0815644]] |
|26 |[[51, 6.058535], [30, 5.7161875], [94, 5.0318503], [88, 4.9334908], [7, 4.8731565], [24, 4.6401963], [55, 4.472246], [53, 4.236663], [77, 4.1913238], [68, 4.061715]] |
|27 |[[38, 4.2577467], [46, 4.063789], [30, 3.945347], [18, 3.7809763], [23, 3.7044308], [17, 3.3986986], [69, 3.2548237], [27, 3.2114499], [1, 3.1994212], [83, 3.1642542]]|
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------++-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|movieId|recommendations |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|31 |[[12, 3.9412215], [8, 3.2569442], [7, 2.8316283], [6, 2.5838141], [15, 2.1348748], [22, 2.0737545], [21, 2.0215204], [25, 1.8854598], [29, 1.4713752], [14, 1.3172331]]|
|85 |[[16, 4.881312], [17, 4.2706757], [14, 4.2134066], [7, 3.6982865], [1, 3.0727763], [15, 2.7946656], [19, 2.5757656], [6, 2.2734714], [20, 2.2645016], [3, 2.1871538]] |
|65 |[[23, 4.6813827], [20, 4.5017395], [25, 3.6871593], [14, 3.652615], [22, 3.0776486], [7, 2.9361606], [6, 2.4191651], [5, 2.20444], [0, 2.1689785], [3, 2.026729]] |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
参考文献
https://glassywing.github.io/2018/04/10/spark-itemcf/