兰德系数 python实现调整兰德系数

转载

mob6454cc6dcf7f 2023-12-18 22:44:39

文章标签 兰德系数 python实现调整兰德系数 ARI 聚类评价指标 文章分类 Python 后端开发

兰德指数(Rand index, RI)

RI取值范围为[0,1]，值越大意味着聚类结果与真实情况越吻合:

如果有了类别标签，那么聚类结果也可以像分类那样计算准确率和召回率。

假设U是外部评价标准，即true_label，而V是聚类结果，设定4个统计量

符号	解释	更直白的解释	决策正确与否
TP / a	在U中为同一类，且在V中也为同一类别的数据点对数	将相似的样本归为同一个簇(同–同)	正确的决策
TN / d	在U中不在同一类，且在V中也不属于同一类别的数据点对数	将不相似的样本归入不同的簇(不同–不同)	正确的决策
FP / c	在U中不在同一类，但在V中维同一类的数据点对数	将不相似的样本归为同一个簇(不同–同)	错误的决策
FN / b	在U中为同一类，但在V中却隶属于不同类别的数据点对数	将相似的样本归入不同的簇(同–不同)	错误的决策

	Same Cluster	Different Cluster	Sum U
Same Class	TP / a	FN / b	a+b
Different Class	FP / c	TN / d	c+d
Sum V	a+c	b+d	a+b+c+d

RI则是计算“正确决策”的比率，故 $兰德系数 python实现调整兰德系数_评价指标$
分母 $兰德系数 python实现调整兰德系数_ARI_02$ 和 $兰德系数 python实现调整兰德系数_调整兰德系数_03$

调整兰德指数(Adjusted Rand index, ARI)

调整兰德系数(Adjusted Rand index, ARI), 为什么要引进 ARI呢，因为 RI 的问题在于对两个随机的划分，其 RI 值不是一个接近于 0 的常数。Hubert和Arabie在1985年提出了调整兰德系数，调整兰德系数假设模型的超分布为随机模型，即 X 和 Y 的划分为随机的，那么各类别和各簇的数据点数目是固定的。

要计算该值, 先计算出列联表(contingency table ), 表中每个值 $兰德系数 python实现调整兰德系数_兰德系数 python实现_04$ 表示某个 document 同时位于 cluster ( $兰德系数 python实现调整兰德系数_ARI_05$ ) 和 class ( $兰德系数 python实现调整兰德系数_评价指标_06$ ) 的个数, 在通过该表可以计算 ARI 值即可。

兰德系数 python实现调整兰德系数_兰德系数 python实现_07

兰德系数 python实现调整兰德系数_聚类_08

$兰德系数 python实现调整兰德系数_兰德系数 python实现_09$

$兰德系数 python实现调整兰德系数_调整兰德系数_10$

优缺点

优点：
1.）对任意数量的聚类中心和样本数，随机聚类的ARI都非常接近于0；
2.）取值在［－1，1］之间，负数代表结果不好，越接近于1越好；
3.) 可用于聚类算法之间的比较
缺点：
1.）ARI需要真实标签

python代码：

需要sklearn库中的adjusted_rand_score的方法

from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

# 基本用法
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # 0.24242424242424246

# 与标签名无关
labels_pred = [1, 1, 0, 0, 3, 3]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # 0.24242424242424246

# 具有对称性
score = metrics.adjusted_rand_score(labels_pred, labels_true)
print(score)  # 0.24242424242424246

# 接近 1 最好
labels_pred = labels_true[:]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # 1.0

# 独立标签结果为负或者接近 0
labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # -0.12903225806451613