一.  scipy.spatial 模块的介绍

在scipy.spatial中最重要的模块应该就是距离计算模块distance了。

from scipy import spatial

距离计算
矩阵距离计算函数
矩阵参数每行代表一个观测值,计算结果就是每行之间的metric距离。Distance matrix computation from a collection of raw observation vectors stored in a rectangular array.

向量距离计算函数Distance functions between two vectors u and v
Distance functions between two vectors u and v. Computingdistances over a large collection of vectors is inefficient for thesefunctions. Use pdist for this purpose.

输入的参数应该是向量,也就是维度应该是(n, ),当然也可以是(1, n)它会使用squeeze自动去掉维度为1的维度;但是如果是多维向量,至少有两个维度>1就会出错。

e.g. spatial.distance.correlation(u, v)    #计算向量u和v之间的相关系数(pearson correlation coefficient, Centered Cosine)

Note: 如果向量u和v元素数目都只有一个或者某个向量中所有元素相同(分母norm(u - u.mean())为0),那么相关系数当然计算无效,会返回nan。

braycurtis(u, v)    Computes the Bray-Curtis distance between two 1-D arrays.
 canberra(u, v)    Computes the Canberra distance between two 1-D arrays.
 chebyshev(u, v)    Computes the Chebyshev distance.
 cityblock(u, v)    Computes the City Block (Manhattan) distance.
 correlation(u, v)    Computes the correlation distance between two 1-D arrays.
 cosine(u, v)    Computes the Cosine distance between 1-D arrays.
 dice(u, v)    Computes the Dice dissimilarity between two boolean 1-D arrays.
 euclidean(u, v)    Computes the Euclidean distance between two 1-D arrays.
 hamming(u, v)    Computes the Hamming distance between two 1-D arrays.
 jaccard(u, v)    Computes the Jaccard-Needham dissimilarity between two boolean 1-D arrays.
 kulsinski(u, v)    Computes the Kulsinski dissimilarity between two boolean 1-D arrays.
 mahalanobis(u, v, VI)    Computes the Mahalanobis distance between two 1-D arrays.
 matching(u, v)    Computes the Matching dissimilarity between two boolean 1-D arrays.
 minkowski(u, v, p)    Computes the Minkowski distance between two 1-D arrays.
 rogerstanimoto(u, v)    Computes the Rogers-Tanimoto dissimilarity between two boolean 1-D arrays.
 russellrao(u, v)    Computes the Russell-Rao dissimilarity between two boolean 1-D arrays.
 seuclidean(u, v, V)    Returns the standardized Euclidean distance between two 1-D arrays.
 sokalmichener(u, v)    Computes the Sokal-Michener dissimilarity between two boolean 1-D arrays.
 sokalsneath(u, v)    Computes the Sokal-Sneath dissimilarity between two boolean 1-D arrays.
 sqeuclidean(u, v)    Computes the squared Euclidean distance between two 1-D arrays.
 wminkowski(u, v, p, w)    Computes the weighted Minkowski distance between two 1-D arrays.
 yule(u, v)    Computes the Yule dissimilarity between two boolean 1-D arrays.


[距离和相似度计算 ]
scipy.spatial.distance.pdist(X, metric=’euclidean’, p=2, w=None, V=None, VI=None)
pdist(X[, metric, p, w, V, VI])Pairwise distances between observations in n-dimensional space.观测值(n维)两两之间的距离。Pairwise distances between observations in n-dimensional space.距离值越大,相关度越小。

注意,距离转换成相似度时,由于自己和自己的距离是不会计算的默认为0,所以要先通过dist = spatial.distance.squareform(dist)转换成dense矩阵,再通过1 - dist计算相似度。

metric:

1 距离计算可以使用自己写的函数。Y = pdist(X, f) Computes the distance between all pairs of vectors in Xusing the user supplied 2-arity function f.

如欧式距离计算可以这样:

dm = pdist(X, lambda u, v: np.sqrt(((u-v)**2).sum()))

但是如果scipy库中有相应的距离计算函数的话,就不要使用dm = pdist(X, sokalsneath)这种方式计算,sokalsneath调用的是python自带的函数,会调用c(n, 2)次;而应该使用scipy中的optimized C version,使用dm = pdist(X, 'sokalsneath')。

再如矩阵行之间的所有cause effect值的计算可以这样:

def causal_effect(m):
     effect = lambda u, v: u.dot(v) / sum(u) - (1 - u).dot(v) / sum(1 - u)
     return spatial.distance.squareform(spatial.distance.pdist(m, metric=effect))


2 这里计算的是两两之间的距离,而不是相似度,如计算cosine距离后要用1-cosine才能得到相似度。从下面的consine计算公式就可以看出。

Y = pdist(X, ’euclidean’)    #d=sqrt((x1-x2)^2+(y1-y2)^2+(z1-z2)^2)
Y = pdist(X, ’minkowski’, p)
scipy.spatial.distance.cdist(XA, XB, metric=’euclidean’, p=2, V=None, VI=None, w=None)
 cdist(XA, XB[, metric, p, V, VI, w])Computes distance between each pair of the two collections of inputs.

当然XA\XB最简单的形式是一个二维向量(也必须是,否则报错ValueError: XA must be a 2-dimensional array.),计算的就是两个向量之间的metric距离度量。

scipy.spatial.distance.squareform(X, force=’no’, checks=True)
squareform(X[, force, checks])Converts a vector-form distance vector to a square-form distance matrix, and vice-versa.

将向量形式的距离表示转换成dense矩阵形式。Converts a vector-form distance vector to a square-form distance matrix, and vice-versa.

注意:Distance matrix 'X' must be symmetric&diagonal must be zero.

皮皮blog

矩阵距离计算示例
示例1

x
 array([[0, 2, 3],
        [2, 0, 6],
        [3, 6, 0]])
 y=dis.pdist(x)
 Iy
 array([ 4.12310563,  5.83095189,  8.54400375])
 z=dis.squareform(y)
 z
 array([[ 0.        ,  4.12310563,  5.83095189],
        [ 4.12310563,  0.        ,  8.54400375],
        [ 5.83095189,  8.54400375,  0.        ]])
 type(z)
 numpy.ndarray
 type(y)
 numpy.ndarray示例2
 print(sim)
 print(spatial.distance.cdist(sim[0].reshape((1, 2)), sim[1].reshape((1, 2)), metric='cosine'))
 print(spatial.distance.pdist(sim, metric='cosine'))
 [[-2.85 -0.45]
  [-2.5   1.04]]
 [[ 0.14790689]][ 0.14790689]
皮皮blog
检验距离矩阵有效性Predicates for checking the validity of distance matrices
 Predicates for checking the validity of distance matrices, bothcondensed and redundant. Also contained in this module are functionsfor computing the number of observations in a distance matrix.is_valid_dm(D[, tol, throw, name, warning])    Returns True if input array is a valid distance matrix.
 is_valid_y(y[, warning, throw, name])    Returns True if the input array is a valid condensed distance matrix.
 num_obs_dm(d)    Returns the number of original observations that correspond to a square, redundant distance matrix.
 num_obs_y(Y)    Returns the number of original observations that correspond to a condensed distance matrix.
 from:
 ref: Distance computations (scipy.spatial.distance)Spatial algorithms and data structures (scipy.spatial)
scipy-ref-0.14.0-p933
 ---------------------

 

二. 在python中计算各种距离

from scipy.spatial.distance import pdist, squareform
下面结合API文档标注一下具体用法:

1.X = pdist(X, 'euclidean')
 计算数组X样本之间的欧式距离 返回值为 Y 为压缩距离元组或矩阵(以下等同)
 2. X = pdist(X, 'minkowski', p)
 计算数组样本之间的明氏距离 
 3. Y = pdist(X, 'cityblock')
 计算数组样本之间的曼哈顿距离
 4. X = pdist(X, 'seuclidean', V=None)
 计算数组样本之间的标准化欧式距离 ,v是方差向量,表示 v[i]表示第i个分量的方差,如果缺失。默认自动计算。
 5. X = pdist(X, 'sqeuclidean')
 计算数组样本之间欧式距离的平方
 6. X = pdist(X, 'cosine')
 计算数组样本之间余弦距离 公式为:
 7. X = pdist(X, 'correlation')
 计算数组样本之间的相关距离。
 8.X = pdist(X, 'hamming')
 计算数据样本之间的汉明距离
 9. X = pdist(X, 'jaccard')
 计算数据样本之间的杰卡德距离
 10. X = pdist(X, 'chebyshev')
 计算数组样本之间的切比雪夫距离
 11. X = pdist(X, 'canberra')
 计算数组样本之间的堪培拉距离
 12. X = pdist(X, 'mahalanobis', VI=None)


计算数据样本之间的马氏距离
还有好多不常用的距离就不一一写出了,如果想查阅可以点点我,点我
除了对指定的距离计算该函数还可以穿lmbda表达式进行计算,如下
dm = pdist(X, lambda u, v: np.sqrt(((u-v)**2).sum()))
二、得到压缩矩阵后还需下一步即:
Y=scipy.spatial.distance.squareform(X, force='no', checks=True)
其中,X就是上文提到的压缩矩阵Y,force 如同MATLAB一样,如果force等于‘tovector’ or ‘tomatrix’,输入就会被当做距离矩阵或距离向量。
cheak当X-X.T比较小或diag(X)接近于零,是是有必要设成True的,返回值Y为一个距离矩阵Y[i,j]表示样本i与样本j的距离。