



python numpy 协方差 numpy 协方差矩阵_标准差

python numpy 协方差 numpy 协方差矩阵_协方差_02

python numpy 协方差 numpy 协方差矩阵_python_03
其中python numpy 协方差 numpy 协方差矩阵_协方差_04是样本均值,反映了n个样本观测值的整体大小情况。
python numpy 协方差 numpy 协方差矩阵_python_05是样本标准差,反应的是样本的离散程度。标准差越大,数据越分散。
python numpy 协方差 numpy 协方差矩阵_标准差_06是样本方差,是python numpy 协方差 numpy 协方差矩阵_python_05的平方。


但是标准差和方差都是针对一维数组的,即1 x d数组。该数组的行代表的是一个随机变量(可理解为属性),如工资等。每一列代表一个观测值。如果一个事物具有多种属性,即有多个随机变量,那么我们会得到一个var_num x d数组。该数组的每一行都是一个随机变量(属性),每一列代表着一个在这些属性维度上的观测值样本。如果我们想要分析该事物,那么仅仅将其剥离为单独的1 x d去求其标准差是不够的,我们还需要关注这些随机变量(属性)variable内部之间的联系。如工资和年龄的联系,工资和技术水平的联系等。




python numpy 协方差 numpy 协方差矩阵_numpy_08

python numpy 协方差 numpy 协方差矩阵_numpy_09

相关系数python numpy 协方差 numpy 协方差矩阵_标准差_10与协方差直接有如下关系:

python numpy 协方差 numpy 协方差矩阵_python numpy 协方差_11

从上述公式可见,相关系数python numpy 协方差 numpy 协方差矩阵_标准差_10实际上也是一种特殊的协方差。相关系数是数据XY做了归一化python numpy 协方差 numpy 协方差矩阵_python numpy 协方差_13,python numpy 协方差 numpy 协方差矩阵_协方差_14之后的协方差。python numpy 协方差 numpy 协方差矩阵_python_15的方差为1,期望为0。有:

python numpy 协方差 numpy 协方差矩阵_python_16





python numpy 协方差 numpy 协方差矩阵_python numpy 协方差_17


python numpy 协方差 numpy 协方差矩阵_numpy_18

我们可以使用一种便捷的矩阵乘法来计算协方差矩阵。设原数据数组为python numpy 协方差 numpy 协方差矩阵_numpy_19。先对X进行处理,求X每一个随机变量的均值。然后对于每一行,减去该行随机变量的均值,得到python numpy 协方差 numpy 协方差矩阵_协方差_20,记协方差矩阵为python numpy 协方差 numpy 协方差矩阵_标准差_21,那么就有:

python numpy 协方差 numpy 协方差矩阵_标准差_22


a = np.array([[1,2,3],[4,5,7]])
cov1 = np.cov(a)
mean_a = np.mean(a,axis=1,keepdims=True)
tmpa = a-mean_a
cov2 = np.matmul(tmpa,tmpa.T)/(tmpa.shape[1]-1)

python numpy 协方差 numpy 协方差矩阵_numpy_23



numpy.cov(m,y=None,rowvar=True,bias=False,ddof=None,fweights=None,aweights=None,dtype) 用于计算给定矩阵和权值的协方差矩阵。


  • m:array_like

A 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables. Also see rowvar below.


  • y:array_like,optional

An additional set of variables and observations. y has the same form as that of m.

如果m.shape = (var_num, obs_num),那么y.shape必须在第二维观测值个数上,即shape[1]m保持一致,即y也得有obs_num个观测值。实际执行时,会先将这两组数据concatenate,然后再求解。


a = np.array([[1,2,3],[4,5,7]])
b = np.array([[1,2,3,4],[4,5,6,7]])
cov = np.cov(a,b)


python numpy 协方差 numpy 协方差矩阵_python numpy 协方差_24


if y is not None:
    y = array(y, copy=False, ndmin=2, dtype=dtype)
    if not rowvar and y.shape[0] != 1:
        y = y.T
    X = np.concatenate((X, y), axis=0)


  • bias: bool, optional

Default normalization (False) is by (N - 1), where N is the number of observations given (unbiased estimate). If bias is True, then normalization is by N. These values can be overridden by using the keyword ddof in numpy versions >= 1.5


  • rowvar : bool, optional

If rowvar is True (default), then each row represents a
variable, with observations in the columns. Otherwise, the relationship
is transposed: each column represents a variable, while the rows
contain observations.


  • ddof : int, optional

If not None the default value implied by bias is overridden.
Note that ddof=1 will return the unbiased estimate, even if both
fweights and aweights are specified, and ddof=0 will return
the simple average. See the notes for the details. The default value
is None.

.. versionadded:: 1.5

ddof,duplicated degrees of freedom,即重复无效的自由度。参见源码详解。

  • fweights : array_like, int, optional

1-D array of integer frequency weights; the number of times each
observation vector should be repeated.

.. versionadded:: 1.10


  • aweights : array_like, optional

1-D array of observation vector weights. These relative weights are
typically large for observations considered “important” and smaller for
observations considered less “important”. If ddof=0 the array of
weights can be used to assign probabilities to observation vectors.

.. versionadded:: 1.10


  • Return:
  • out: ndarray: The covariance matrix of the variables.




if ddof is not None and ddof != int(ddof):   # 这里说明ddof必须是int类型
        raise ValueError(
            "ddof must be integer")

    # Handles complex arrays too
    m = np.asarray(m)       # 所以m的输入类型可以是lists, lists of tuples
                            #tuples, tuples of tuples, tuples of lists and ndarrays.
    if m.ndim > 2:          # 不能超过两维
        raise ValueError("m has more than 2 dimensions")

    if y is None:           # 如果y是None,返回数组类型取原数组类型
                            # 与np.float64精度高的那一个。
        dtype = np.result_type(m, np.float64)   
    else:                   # 有y输入则先处理y,判断y的维度,再判断数据类型
        y = np.asarray(y)
        if y.ndim > 2:
            raise ValueError("y has more than 2 dimensions")
        dtype = np.result_type(m, y, np.float64)

    X = array(m, ndmin=2, dtype=dtype)
    if not rowvar and X.shape[0] != 1:  # 如果rowvar为False就转置
        X = X.T
    if X.shape[0] == 0:
        return np.array([]).reshape(0, 0)
    if y is not None:                    # 对y进行处理
        y = array(y, copy=False, ndmin=2, dtype=dtype)
        if not rowvar and y.shape[0] != 1:  # 判断rowvar是否转置
            y = y.T
        X = np.concatenate((X, y), axis=0)  # concatenate

    if ddof is None:            # 如果未指定ddof
        if bias == 0:           # 如果指定了bias=0,ddof=1,无偏
            ddof = 1
        else:                   # 否则ddof=0
            ddof = 0

    # Get the product of frequencies and weights
    w = None
    if fweights is not None:
        fweights = np.asarray(fweights, dtype=float)
        if not np.all(fweights == np.around(fweights)):  # round进行取整
    # 取整后判断是否全部相等,来判断全都是整数,必须全是整数,否则报错
            raise TypeError(
                "fweights must be integer")
        if fweights.ndim > 1:  # 必须一维
            raise RuntimeError(
                "cannot handle multidimensional fweights")
        if fweights.shape[0] != X.shape[1]: # 必须与观测数一致
            raise RuntimeError(
                "incompatible numbers of samples and fweights")
        if any(fweights < 0):   #必须全部为正值
            raise ValueError(
                "fweights cannot be negative")
        w = fweights        # 将fweight赋给w
    if aweights is not None:
        aweights = np.asarray(aweights, dtype=float)
        if aweights.ndim > 1:
            raise RuntimeError(
                "cannot handle multidimensional aweights")
        if aweights.shape[0] != X.shape[1]:
            raise RuntimeError(
                "incompatible numbers of samples and aweights")
        if any(aweights < 0):
            raise ValueError(
                "aweights cannot be negative")
        if w is None:
            w = aweights    # 如果fweight为空,就直接把aweight赋给w
            w *= aweights   # 否则w = fweight * aweight

    avg, w_sum = average(X, axis=1, weights=w, returned=True)
    # 以列为操作单元,求每一个随便变量的所有观测值在权重w下的均值。
    # w_sum为w的所有元素的和(权重和)。
    w_sum = w_sum[0]

    # Determine the normalization
    if w is None:       # 如果w为None,那么直接用X的观测值个数(列数)减ddof
        fact = X.shape[1] - ddof
    elif ddof == 0:  # w不为空,ddof等于0,需要除以的分母就是 w_sum
        fact = w_sum
    elif aweights is None: # w不为空,aweight为空,ddof不为0
    # 直接用 w_sum-ddof(因为此时的w_sum就相当于重复后的观测值个数)
        fact = w_sum - ddof
    else:   # w不为空,aweight也不为空, fweight也不为空,ddof != 0
    # fact就相当于w_sum减去以w为权重的aweight的平均值乘以ddof
    # 当aweigth=None的时候,是这个公式的一个特殊情况
    # 在这里猜测:ddof: duplicated degreeds of freedom   
    # 即重复无效的自由度
        fact = w_sum - ddof*sum(w*aweights)/w_sum

    if fact <= 0:
        warnings.warn("Degrees of freedom <= 0 for slice",
                      RuntimeWarning, stacklevel=3)
        fact = 0.0

    X -= avg[:, None]   # X减去均值
    if w is None:
        X_T = X.T
        X_T = (X*w).T   # 乘以权重 
    c = dot(X, X_T.conj())  # X 乘以 X的转置的复共轭矩阵(对标量而言就是转置)
    c *= np.true_divide(1, fact)    # 再除以fact
    return c.squeeze()  # 删去c中dim为1的维度,输出。

