18.1 概率潜在语义分析模型

【补充解释】“软的类别”是指软聚类中的类别。软聚类是指,一个单词或文本可以被分到两个或两个以上的类的分类方法。

【补充说明】图18.5中的单词单纯形,即在定义域内,距离坐标原点的曼哈顿距离为 1 1 1的点所构成的平面。

多项分布

已知随机试验中,事件 A A A发生的概率是 p p p,则事件 A A A不发生的概率是 q = 1 − p q=1-p q=1p n n n次独立重复的试验中, x x x表示事件 A A A发生的次数,则事件 A A A发生 k k k次的概率为
P ( x = k ) = C n k × p k × q n − k P(x=k) = C_n^k × p^k × q^{n-k} P(x=k)=Cnk×pk×qnk
那么就称 x x x服从参数为 p p p的二项分布。

而多项分布是二项分布的推广,在每一次随机试验中,不只讨论两个对立事件的情况,而是考虑 N N N个事件的情况。

已知随机试验中,事件 A i A_i Ai发生的概率是 p i p_i pi i = 1 , 2 , ⋯   , N i=1,2,\cdots,N i=1,2,,N)。 n n n次独立重复的试验中, x i x_i xi表示事件 A i A_i Ai发生的次数,则事件 A 1 A_1 A1发生 k 1 k_1 k1次、事件 A 2 A_2 A2发生 k 2 k_2 k2次、……、事件 A n A_n An发生 k n k_n kn次的概率为
P ( x 1 = k 1 , x 2 = k 2 , ⋯   , x n = k n ) = N ! k 1 ! k 2 ! ⋯ k n ! p 1 k 1 p 2 k 2 ⋯ p n k n P(x_1=k_1,x_2=k_2,\cdots,x_n=k_n) = \frac{N!}{k_1! k_2! \cdots k_n!} p_1^{k_1} p_2^{k_2} \cdots p_n^{k_n} P(x1=k1,x2=k2,,xn=kn)=k1!k2!kn!N!p1k1p2k2pnkn
那么就称 x = ( x 1 , x 2 , ⋯   , x n ) x = (x_1,x_2,\cdots,x_n) x=(x1,x2,,xn)服从参数为 ( p 1 , p 2 , ⋯   , p n ) (p_1,p_2,\cdots,p_n) (p1,p2,,pn)的多项分布。其中 p i ≥ 0 p_i \ge 0 pi0 p 1 + p 2 + ⋯ + p n = 1 p_1+p_2+\cdots+p_n = 1 p1+p2++pn=1 k i k_i ki为任意非负整数, k 1 + k 2 + ⋯ + k n = n k_1+k_2+\cdots+k_n=n k1+k2++kn=n

18.2 概率潜在语义分析的算法

【补充说明】对数似然函数式是已经舍去了 P ( d j ) P(d_j) P(dj)后的对数似然函数。

【补充说明】式(18.9)中大括号内的部分是尚未舍去 P ( d j ) P(d_j) P(dj)的完全数据的对数似然函数,之所以写成式(18.9)中的形式,是为了将 ∑ j = 1 N n ( d j ) l o g   P ( d j ) \sum_{j=1}^N n(d_j) log \ P(d_j) j=1Nn(dj)log P(dj)项完全成为常数,可以被舍去。

概率潜在语义分析模型生成模型的对数似然函数推导

已知概率潜在语义分析模型(生成模型)的对数似然函数是
L = ∑ i = 1 M ∑ j = 1 N n ( w i , d j ) l o g P ( w i , d j ) L = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) log P(w_i,d_j) L=i=1Mj=1Nn(wi,dj)logP(wi,dj)
将式(18.2)代入,得
L = ∑ i = 1 M ∑ j = 1 N n ( w i , d j ) l o g [ P ( d j ) ∑ k = 1 K P ( z j ∣ d j ) P ( w j ∣ z j ) ] = ∑ i = 1 M ∑ j = 1 N n ( w i , d j ) [ l o g   P ( d j ) + l o g ∑ k = 1 K P ( z j ∣ d j ) P ( w j ∣ z j ) ] = ∑ j = 1 N n ( d j ) [ l o g   P ( d j ) + ∑ i = 1 M n ( w i , d j ) n ( d j ) l o g ∑ k = 1 K P ( z j ∣ d j ) P ( w j ∣ z j ) ] \begin{aligned} L & = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) log \bigg[ P(d_j) \sum_{k=1}^K P(z_j|d_j) P(w_j|z_j) \bigg] \\ & = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) \bigg[ log \ P(d_j) + log \sum_{k=1}^K P(z_j|d_j) P(w_j|z_j) \bigg] \\ & = \sum_{j=1}^N n(d_j)\bigg[ log \ P(d_j) + \sum_{i=1}^M \frac{n(w_i,d_j)}{n(d_j)} log \sum_{k=1}^K P(z_j|d_j) P(w_j|z_j) \bigg] \end{aligned} L=i=1Mj=1Nn(wi,dj)log[P(dj)k=1KP(zjdj)P(wjzj)]=i=1Mj=1Nn(wi,dj)[log P(dj)+logk=1KP(zjdj)P(wjzj)]=j=1Nn(dj)[log P(dj)+i=1Mn(dj)n(wi,dj)logk=1KP(zjdj)P(wjzj)]
上式即书中式(18.9)大括号中的部分。因其中 ∑ j = 1 N n ( d j ) l o g   P ( d j ) \sum_{j=1}^N n(d_j) log \ P(d_j) j=1Nn(dj)log P(dj)为常量,不影响极大化对数似然函数的结果,故舍去。于是有
L = ∑ i = 1 M ∑ j = 1 N n ( w i , d j ) l o g [ ∑ k = 1 K P ( z j ∣ d j ) P ( w j ∣ z j ) ] L = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) log \bigg[ \sum_{k=1}^K P(z_j|d_j) P(w_j|z_j) \bigg] L=i=1Mj=1Nn(wi,dj)log[k=1KP(zjdj)P(wjzj)]
上式即书中的对数似然函数。

式18.10式的推导(概率潜在语义分析模型生成模型的EM算法的Q函数)

因为概率潜在语义分析模型(生成模型)的参数完全数据为 P ( w , z , d ) = P ( w ∣ z ) P ( z ∣ d ) P(w,z,d) = P(w|z)P(z|d) P(w,z,d)=P(wz)P(zd),不完全数据为 P ( z ∣ w , d ) P(z|w,d) P(zw,d)。于是有完全数据的对数似然函数
l o g   P ( w , z , d ∣ θ ) = ∑ i = 1 M ∑ j = 1 N ∑ k = 1 K n ( w i , d j ) l o g P ( w i , z k , d j ) = ∑ i = 1 M ∑ j = 1 N n ( w i , d j ) ∑ k = 1 K l o g [ P ( w j ∣ z j ) P ( z j ∣ d j ) ] \begin{aligned} log \ P(w,z,d|\theta) & = \sum_{i=1}^M \sum_{j=1}^N \sum_{k=1}^K n(w_i,d_j) log P(w_i,z_k,d_j) \\ & = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) \sum_{k=1}^K log \bigg[ P(w_j|z_j) P(z_j|d_j) \bigg] \end{aligned} log P(w,z,dθ)=i=1Mj=1Nk=1Kn(wi,dj)logP(wi,zk,dj)=i=1Mj=1Nn(wi,dj)k=1Klog[P(wjzj)P(zjdj)]
于是有
Q ( θ , θ ( i ) ) = E Z [ l o g   P ( w , z , d ∣ θ ) ∣ z , θ ( i ) ] = E Z { ∑ i = 1 M ∑ j = 1 N n ( w i , d j ) ∑ k = 1 K l o g [ P ( w j ∣ z j ) P ( z j ∣ d j ) ] } = ∑ i = 1 M ∑ j = 1 N n ( w i , d j )   E Z { ∑ k = 1 K l o g [ P ( w j ∣ z j ) P ( z j ∣ d j ) ] } = ∑ i = 1 M ∑ j = 1 N n ( w i , d j )   ∑ k = 1 K P ( z k ∣ w i , d j ) l o g [ P ( w j ∣ z j ) P ( z j ∣ d j ) ] \begin{aligned} Q(\theta,\theta^{(i)}) & = E_Z[log \ P(w,z,d|\theta)|z,\theta^{(i)}] \\ & = E_Z \Bigg\{ \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) \sum_{k=1}^K log \bigg[ P(w_j|z_j) P(z_j|d_j) \bigg] \Bigg\} \\ & = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) \ E_Z \Bigg\{ \sum_{k=1}^K log \bigg[ P(w_j|z_j) P(z_j|d_j) \bigg] \Bigg\} \\ & = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) \ \sum_{k=1}^K P(z_k|w_i,d_j) log \bigg[ P(w_j|z_j) P(z_j|d_j) \bigg] \end{aligned} Q(θ,θ(i))=EZ[log P(w,z,dθ)z,θ(i)]=EZ{i=1Mj=1Nn(wi,dj)k=1Klog[P(wjzj)P(zjdj)]}=i=1Mj=1Nn(wi,dj) EZ{k=1Klog[P(wjzj)P(zjdj)]}=i=1Mj=1Nn(wi,dj) k=1KP(zkwi,dj)log[P(wjzj)P(zjdj)]
上式即书中式(18.10)。

式18.12和式18.13的推导(概率潜在语义分析模型生成模型的EM算法的M步)

令拉格朗日函数 Λ \Lambda Λ分别对 P ( w i ∣ z k ) P(w_i|z_k) P(wizk) P ( z k ∣ d j ) P(z_k|d_j) P(zkdj)的偏导数为0,得到下面的方程组

∑ j = 1 N n ( w i , d j ) P ( z k ∣ w i , d j ) − τ k P ( w i ∣ z k ) = 0 , i = 1 , 2 , ⋯   , M ; k = 1 , 2 , ⋯   , K (1) \sum_{j=1}^N n(w_i,d_j) P(z_k|w_i,d_j) - \tau_k P(w_i|z_k) = 0, \hspace{1em} i=1,2,\cdots,M; \hspace{1em} k=1,2,\cdots,K \tag{1} j=1Nn(wi,dj)P(zkwi,dj)τkP(wizk)=0,i=1,2,,M;k=1,2,,K(1)

∑ i = 1 M n ( w i , d j ) P ( z k ∣ w i , d j ) − ρ j P ( z k ∣ d j ) = 0 , j = 1 , 2 , ⋯   , N ; k = 1 , 2 , ⋯   , K (2) \sum_{i=1}^M n (w_i,d_j) P(z_k|w_i,d_j) - \rho_j P(z_k|d_j) = 0, \hspace{1em} j=1,2,\cdots,N; \hspace{1em} k=1,2,\cdots,K \tag{2} i=1Mn(wi,dj)P(zkwi,dj)ρjP(zkdj)=0,j=1,2,,N;k=1,2,,K(2)

将式(1)对 i = 1 , 2 , ⋯   , M i=1,2,\cdots,M i=1,2,,M求和,同时考虑约束条件 ∑ i = 1 M P ( w i ∣ z k ) = 1 \sum_{i=1}^M P(w_i|z_k)=1 i=1MP(wizk)=1,得
τ k = ∑ i = 1 M ∑ j = 1 N n ( w i , d j ) P ( z k ∣ w i , d j ) ∑ i = 1 M P ( w i ∣ z k ) = ∑ i = 1 M ∑ j = 1 N n ( w i , d j ) P ( z k ∣ w i , d j ) (3) \tau_k = \frac{\sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) P(z_k|w_i,d_j)}{\sum_{i=1}^M P(w_i|z_k)} = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) P(z_k|w_i,d_j) \tag{3} τk=i=1MP(wizk)i=1Mj=1Nn(wi,dj)P(zkwi,dj)=i=1Mj=1Nn(wi,dj)P(zkwi,dj)(3)
类似地,将式(2)对 k = 1 , 2 , ⋯   , K k=1,2,\cdots,K k=1,2,,K求和,同时考虑约束条件 ∑ k = 1 K P ( z k ∣ d j ) = 1 \sum_{k=1}^K P(z_k|d_j)=1 k=1KP(zkdj)=1,得
KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ \rho_j & = \f…
将式(3)代入式(1)得
P ( w i ∣ z k ) = ∑ j = 1 N n ( w i , d j ) P ( z k ∣ w i , d j ) ∑ m = 1 M ∑ j = 1 N n ( w m , d j ) P ( z k ∣ w m , d j ) (5) P(w_i|z_k) = \frac{\sum_{j=1}^N n(w_i,d_j) P(z_k|w_i,d_j)}{\sum_{m=1}^M \sum_{j=1}^N n(w_m,d_j) P(z_k|w_m,d_j)} \tag{5} P(wizk)=m=1Mj=1Nn(wm,dj)P(zkwm,dj)j=1Nn(wi,dj)P(zkwi,dj)(5)
上式即书中式(18.12)。

将式(4)代入式(2)得
P ( z k ∣ d j ) = ∑ i = 1 M n ( w i , d j ) P ( z k ∣ w i , d j ) n ( d j ) (6) P(z_k|d_j) = \frac{\sum_{i=1}^M n (w_i,d_j) P(z_k|w_i,d_j)}{n(d_j)} \tag{6} P(zkdj)=n(dj)i=1Mn(wi,dj)P(zkwi,dj)(6)
上式即书中式(18.13)。

概率潜在语义模型参数估计的EM算法(原生Python实现)

import numpy as np

def em_for_plsa(X, K, max_iter=100, random_state=0):
    """概率潜在语义模型参数估计的EM算法

    :param X: 单词-文本共现矩阵
    :param K: 话题数量
    :param max_iter: 最大迭代次数
    :param random_state: 随机种子
    :return: P(w_i|z_k)和P(z_k|d_j)
    """
    n_features, n_samples = X.shape

    # 计算n(d_j)
    N = [np.sum(X[:, j]) for j in range(n_samples)]

    # 设置参数P(w_i|z_k)和P(z_k|d_j)的初始值
    np.random.seed(random_state)
    P1 = np.random.random((n_features, K))  # P(w_i|z_k)
    P2 = np.random.random((K, n_samples))  # P(z_k|d_j)

    for _ in range(max_iter):
        # E步
        P = np.zeros((n_features, n_samples, K))
        for i in range(n_features):
            for j in range(n_samples):
                for k in range(K):
                    P[i][j][k] = P1[i][k] * P2[k][j]
                P[i][j] /= np.sum(P[i][j])

        # M步
        for k in range(K):
            for i in range(n_features):
                P1[i][k] = np.sum([X[i][j] * P[i][j][k] for j in range(n_samples)])
            P1[:, k] /= np.sum(P1[:, k])

        for k in range(K):
            for j in range(n_samples):
                P2[k][j] = np.sum([X[i][j] * P[i][j][k] for i in range(n_features)]) / N[j]

    return P1, P2

【测试】习题18.3(例17.2.2)

if __name__ == "__main__":
    X = np.array([[0, 0, 1, 1, 0, 0, 0, 0, 0],
                  [0, 0, 0, 0, 0, 1, 0, 0, 1],
                  [0, 1, 0, 0, 0, 0, 0, 1, 0],
                  [0, 0, 0, 0, 0, 0, 1, 0, 1],
                  [1, 0, 0, 0, 0, 1, 0, 0, 0],
                  [1, 1, 1, 1, 1, 1, 1, 1, 1],
                  [1, 0, 1, 0, 0, 0, 0, 0, 0],
                  [0, 0, 0, 0, 0, 0, 1, 0, 1],
                  [0, 0, 0, 0, 0, 2, 0, 0, 1],
                  [1, 0, 1, 0, 0, 0, 0, 1, 0],
                  [0, 0, 0, 1, 1, 0, 0, 0, 0]])

    np.set_printoptions(precision=2, suppress=True)
    R1, R2 = em_for_plsa(X, 3)

    print(R1)
    # [[0.   0.15 0.  ]
    #  [0.15 0.   0.  ]
    #  [0.   0.   0.4 ]
    #  [0.15 0.   0.  ]
    #  [0.08 0.08 0.  ]
    #  [0.23 0.31 0.4 ]
    #  [0.   0.15 0.  ]
    #  [0.15 0.   0.  ]
    #  [0.23 0.   0.  ]
    #  [0.   0.15 0.2 ]
    #  [0.   0.15 0.  ]]

    print(R2)
    # [[0. 0. 0. 0. 0. 1. 1. 0. 1.]
    #  [1. 0. 1. 1. 1. 0. 0. 0. 0.]
    #  [0. 1. 0. 0. 0. 0. 0. 1. 0.]]