文章目录
- 1. 闵可夫斯基距离 Minkowski Distance
- p=1时 曼哈顿距离 Manhattan Distance
- p=2时 欧氏距离 Euclidean Distance
- 标准化欧氏距离Standardized Euclidean Distance
- p->∞ 切比雪夫距离 Chebyshev Distance
- 2.余弦相似度 Cosine Similarity
- 修正余弦相似度 Adjusted Cosine Similarity
- 3.皮尔逊线性相关系数 Pearson Correlation Coefficient
- 4.马氏距离 Mahalanobis Distance
- 5.杰卡德距离 Jaccard Distance
- 6. 布雷克斯距离 Bray Curtis Distance
- 7. 斯皮尔曼等级相关系数 Spearman's Rank Correlation Coefficient
- 8 肯德尔相关系数
- 9 编辑距离与汉明距离
- 参考文章
1. 闵可夫斯基距离 Minkowski Distance
适用条件:
- 每个空间内的数值是连续的
- 由于闵可夫斯基距离不会考虑不同值之间的量纲是否统一,因此在计算相似度时所有值的含义应该相同
闵可夫斯基距离也称明氏距离,根据p值的不同,也可以表示为下面的任意距离
Latex表示:y=\left( \sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|^{p}} \right)^{\frac{1}{p}}
公式表示:
Python实现:
import numpy as np
import pandas as pd
def minkowski_distance(x, y, p):
return np.sum([np.abs(x - y) ** p]) ** (1.0 / p)
def minkowski_distance_2(x, y, p):
from scipy.spatial.distance import pdist
X = np.vstack([x, y])
return pdist(X, 'minkowski', p)[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(minkowski_distance(df.x, df.y, p=5))
print(minkowski_distance_2(df.x, df.y, p=5))
# p=1为曼哈顿距离,p=2为欧氏距离p=1时 曼哈顿距离 Manhattan Distance

其中:黄线,橙线都表示曼哈顿距离;红蓝线表示等价的曼哈顿距离;绿线表示欧氏距离
直观感受: X 允许横向与纵向移动时,离 Y 的最小距离
Latex表示:y=\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|}
公式表示:
Python实现:
import pandas as pd
import numpy as np
def manhattan_distance(x, y):
return np.sum(np.abs(x - y))
def manhattan_distance_2(x, y):
from scipy.spatial.distance import pdist
X = np.vstack([x, y])
return pdist(X, 'cityblock')[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(manhattan_distance(df.x, df.y))
print(manhattan_distance_2(df.x, df.y))p=2时 欧氏距离 Euclidean Distance
如上图中的绿色所示,欧式距离为对应坐标差值的平方,求和后开根号,即为x到y的直线距离
直观感受:X 与 Y 的直线距离
Latex表示:y=\sqrt{\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|^{2}}}
公式表示:
Python实现:
import pandas as pd
import numpy as np
def euclidean_distance(x, y):
return np.sqrt(np.sum(np.square(x - y)))
def euclidean_distance_2(x, y):
from scipy.spatial.distance import pdist
X = np.vstack([x, y])
return pdist(X)[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(euclidean_distance(df.x, df.y))
print(euclidean_distance_2(df.x, df.y))标准化欧氏距离Standardized Euclidean Distance
scipy doc:https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.seuclidean.html
Latex表示:D=\; \sqrt{\sum_{i=1}^{n}{\left( \frac{x_{i}-y_{i}}{S_{k}} \right)}}
公式表示:
Python实现:
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist
def standardized_euclidean_distance(x, y):
sk = np.var(np.vstack([x, y]), axis=0, ddof=1)
return np.sqrt(((x - y) ** 2 / sk).sum())
def standardized_euclidean_distance_2(x, y):
return pdist(np.vstack([x, y]), 'seuclidean')[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(10, 2)), columns=['x', 'y'])
print(standardized_euclidean_distance(df.x, df.y))
print(standardized_euclidean_distance_2(df.x, df.y))p->∞ 切比雪夫距离 Chebyshev Distance

直观感受:X 可以 上下左右斜着走时,到 Y 的最小距离
Latex表示:y=\; \max _{i}\left( \left| x_{i}-y_{i} \right| \right)
公式表示:
Python实现:
import numpy as np
import pandas as pd
def chebyshev_distance(x, y):
return np.max(np.abs(x - y))
def chebyshev_distance_2(x, y):
from scipy.spatial.distance import pdist
X = np.vstack([x, y])
return pdist(X, 'chebyshev')[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(chebyshev_distance(df.x, df.y))
print(chebyshev_distance_2(df.x, df.y))2.余弦相似度 Cosine Similarity
直观感受:两个指向高维度空间的向量之间的 夹角
Latex表示:y=\frac{\sum_{i=1}^{n}{\left( x_{i}\; \times \; y_{i} \right)}}{\sqrt{\sum_{i=1}^{n}{x_{i}^{2}}}\; \times \; \sqrt{\sum_{i=1}^{n}{y_{i}^{2}}}}
公式表示:
Python实现:
import numpy as np
import pandas as pd
def cosine_similarity(x, y):
return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
def cosine_similarity_2(x, y):
from scipy.spatial.distance import pdist
return 1 - pdist(np.vstack([x, y]), 'cosine')[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(cosine_similarity(df.x, df.y))
print(cosine_similarity_2(df.x, df.y))修正余弦相似度 Adjusted Cosine Similarity
举个栗子:
- 小A习惯差评,电影1 看睡着了打1分,电影2 深度好片打3分,电影3 中规中矩打2分;
- 小B习惯好评,电影1 也看睡着了打4分,电影2 深度好片打5分,电影3 中规中矩打4.5分;
- 其实小A小B口味还是差不多的,但是计算
cosine_similarity([1, 3, 2], [4, 5, 4.5])计算结果是0.95,并不是预期的1,三个维度是如此,如果维度多了,那么偏差值会更大,
import numpy as np
from sklearn.preprocessing import StandardScaler
def cosine_similarity(x, y):
return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
def sim_acs(x, y):
# 先做标准化,再计算余弦相似度
stdsc = StandardScaler()
x = stdsc.fit_transform(x)
y = stdsc.fit_transform(y)
return cosine_similarity(x, y)
if __name__ == '__main__':
print(cosine_similarity([2, 3, 2.5], [4, 5, 4.5])) # 0.9973.皮尔逊线性相关系数 Pearson Correlation Coefficient
这个指用以衡量两组数据的线性相关的程度。皮尔逊相关系数值域为[-1,1],值大于0时为正相关,1为线性正相关;值小于0是为负相关,-1时为线性负相关。
Latex表示:y=\frac{\mbox{C}ov\left( X,Y \right)}{\sqrt{D\left( X \right)}\; \sqrt{D\left( Y \right)}}
公式表示:
import numpy as np
import pandas as pd
def person_correlation(x, y):
x_mean = x - np.mean(x)
y_mean = y - np.mean(y)
return np.dot(x_mean, y_mean) / (np.linalg.norm(x_mean) * np.linalg.norm(y_mean))
def person_correlation_2(x, y):
X = np.vstack([x, y])
return np.corrcoef(X)[0][1]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(person_correlation(df.x, df.y))
print(person_correlation_2(df.x, df.y))4.马氏距离 Mahalanobis Distance
Latex表示:D_{M}\left( X,Y \right)=\sqrt{\left( X-Y \right)^{T}S^{-1}\left( X-Y \right)}
公式表示:,其中
为协方差矩阵
Python实现:
import numpy as np
import pandas as pd
def mahalanobis_distance(x, y):
X = np.vstack([x, y])
XT = X.T
S = np.cov(X) # 两个维度之间协方差矩阵
SI = np.linalg.inv(S) # 协方差矩阵的逆矩阵
n = XT.shape[0]
d1 = []
for i in range(0, n):
for j in range(i + 1, n):
delta = XT[i] - XT[j]
d = np.sqrt(np.dot(np.dot(delta, SI), delta.T))
d1.append(d)
return d1
def mahalanobis_distance_2(x, y):
from scipy.spatial.distance import pdist
return pdist(np.vstack([x, y]).T, 'mahalanobis')
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(mahalanobis_distance(df.x, df.y))
print(mahalanobis_distance_2(df.x, df.y))5.杰卡德距离 Jaccard Distance
杰卡德相似系数(Jaccard similarity coefficient)是两个集合A和B的交集元素在A,B的并集中所占的比例,而杰卡德距离是1-杰卡德相似系数
Latex表示:J\left( A,B \right)=\frac{\left| A\cup B \right|\; -\; \left| A\cap B \right|}{\left| A\cup B \right|}
公式表示:
Python实现:
from scipy.spatial.distance import pdist
import pandas as pd
import numpy as np
def jaccard_distance(x, y):
up = np.double(np.bitwise_and((x != y), np.bitwise_or(x != 0, y != 0)).sum())
down = np.double(np.bitwise_or(x != 0, y != 0).sum())
return 1 - (up / down)
def jaccard_distance_2(x, y):
X = np.vstack([x, y])
return 1 - pdist(X, 'jaccard')[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 20, size=(3, 2)), columns=['x', 'y'])
print(jaccard_distance(df.x, df.y))
print(jaccard_distance_2(df.x, df.y)6. 布雷克斯距离 Bray Curtis Distance
适用于X与Y的值非负的情况
含义:常用于生态学和环境科学等坐标计算,与粗略估计样本的差异性。计算方法是用X于Y的差值的求和,比X与Y的所有值的总和,值域为[0,1],越接近0,表明样本差异越小。
Latex表示:y=\frac{\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|}}{\sum_{i=1}^{n}{x_{i}\; +\; \sum_{i=1}^{n}{y_{i}}}}
公示表示:
Python实现:
import numpy as np
from scipy.spatial.distance import pdist
import pandas as pd
def bray_curtis_distance(x, y):
up = np.sum(np.abs(y - x))
down = np.sum(x) + np.sum(y)
d1 = (up / down)
return d1
def bray_curtis_distance_2(x, y):
X = np.vstack([x, y])
return pdist(X, 'braycurtis')[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(1, 2)), columns=['x', 'y'])
print(bray_curtis_distance(df.x, df.y))
print(bray_curtis_distance_2(df.x, df.y))7. 斯皮尔曼等级相关系数 Spearman’s Rank Correlation Coefficient
含义:皮尔逊相关系数会受到异常值的影响较大,而斯皮尔曼相关系数借助排序,可以消除掉部分异常值造成的影响。斯皮尔曼等级相关系数的应用范围比皮尔逊相关系数更广泛。但是弊端是相关系数的差距体现在排名的差值上,如果数据量太少的话,平方项体现不明显,使这个系数表现不太好。
与皮尔逊相关系数一样值域为[-1,1],值域大于0为正相关,反之负相关,越接近1 相关性越明显
youtube视频详解:https:///watch?v=DE58QuNKA-c
Latex表示:D=1-\frac{6\sum_{i=1}^{n}{d_{i}^{2}}}{n^{3}-n}
公式表示:
import pandas as pd
import numpy as np
def spearman_rank_correlation(x, y):
from scipy.stats import spearmanr
r, p = spearmanr(x, y)
return r, p
def spearman_rank_correlation_2(dataframe: pd.DataFrame) -> pd.DataFrame:
return dataframe.corr('spearman')
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(spearman_rank_correlation(df.x, df.y)[1])
print(spearman_rank_correlation_2(df))8 肯德尔相关系数
Latex表示:
公式表示:
import pandas as pd
import numpy as np
def kendall_correlation_coefficient(x, y):
from scipy.stats import kendalltau
return kendalltau(x, y)[0]
def kendall_correlation_coefficient_2(dataframe: pd.DataFrame) -> pd.DataFrame:
return dataframe.corr('kendall')
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(kendall_correlation_coefficient(df.x, df.y))
print(kendall_correlation_coefficient_2(df))9 编辑距离与汉明距离
参考文章
- Scipy Distance computations Doc:https://docs.scipy.org/doc/scipy/reference/spatial.distance.html#distance-computations-scipy-spatial-distance
- 各种距离的归纳和总结:https://zhuanlan.zhihu.com/p/58819850
- Measuring Distance:https:///Chris3606/GoRogue/wiki/Measuring-Distance
- 度量学习中的马氏距离:https://www.jianshu.com/p/5706a108a0c6
















