18.1 概率潜在语义分析模型
【补充解释】“软的类别”是指软聚类中的类别。软聚类是指,一个单词或文本可以被分到两个或两个以上的类的分类方法。
【补充说明】图18.5中的单词单纯形,即在定义域内,距离坐标原点的曼哈顿距离为 1 1 1的点所构成的平面。
多项分布
已知随机试验中,事件
A
A
A发生的概率是
p
p
p,则事件
A
A
A不发生的概率是
q
=
1
−
p
q=1-p
q=1−p。
n
n
n次独立重复的试验中,
x
x
x表示事件
A
A
A发生的次数,则事件
A
A
A发生
k
k
k次的概率为
P
(
x
=
k
)
=
C
n
k
×
p
k
×
q
n
−
k
P(x=k) = C_n^k × p^k × q^{n-k}
P(x=k)=Cnk×pk×qn−k
那么就称
x
x
x服从参数为
p
p
p的二项分布。
而多项分布是二项分布的推广,在每一次随机试验中,不只讨论两个对立事件的情况,而是考虑 N N N个事件的情况。
已知随机试验中,事件
A
i
A_i
Ai发生的概率是
p
i
p_i
pi(
i
=
1
,
2
,
⋯
,
N
i=1,2,\cdots,N
i=1,2,⋯,N)。
n
n
n次独立重复的试验中,
x
i
x_i
xi表示事件
A
i
A_i
Ai发生的次数,则事件
A
1
A_1
A1发生
k
1
k_1
k1次、事件
A
2
A_2
A2发生
k
2
k_2
k2次、……、事件
A
n
A_n
An发生
k
n
k_n
kn次的概率为
P
(
x
1
=
k
1
,
x
2
=
k
2
,
⋯
,
x
n
=
k
n
)
=
N
!
k
1
!
k
2
!
⋯
k
n
!
p
1
k
1
p
2
k
2
⋯
p
n
k
n
P(x_1=k_1,x_2=k_2,\cdots,x_n=k_n) = \frac{N!}{k_1! k_2! \cdots k_n!} p_1^{k_1} p_2^{k_2} \cdots p_n^{k_n}
P(x1=k1,x2=k2,⋯,xn=kn)=k1!k2!⋯kn!N!p1k1p2k2⋯pnkn
那么就称
x
=
(
x
1
,
x
2
,
⋯
,
x
n
)
x = (x_1,x_2,\cdots,x_n)
x=(x1,x2,⋯,xn)服从参数为
(
p
1
,
p
2
,
⋯
,
p
n
)
(p_1,p_2,\cdots,p_n)
(p1,p2,⋯,pn)的多项分布。其中
p
i
≥
0
p_i \ge 0
pi≥0,
p
1
+
p
2
+
⋯
+
p
n
=
1
p_1+p_2+\cdots+p_n = 1
p1+p2+⋯+pn=1;
k
i
k_i
ki为任意非负整数,
k
1
+
k
2
+
⋯
+
k
n
=
n
k_1+k_2+\cdots+k_n=n
k1+k2+⋯+kn=n。
18.2 概率潜在语义分析的算法
【补充说明】对数似然函数式是已经舍去了 P ( d j ) P(d_j) P(dj)后的对数似然函数。
【补充说明】式(18.9)中大括号内的部分是尚未舍去 P ( d j ) P(d_j) P(dj)的完全数据的对数似然函数,之所以写成式(18.9)中的形式,是为了将 ∑ j = 1 N n ( d j ) l o g P ( d j ) \sum_{j=1}^N n(d_j) log \ P(d_j) ∑j=1Nn(dj)log P(dj)项完全成为常数,可以被舍去。
概率潜在语义分析模型生成模型的对数似然函数推导
已知概率潜在语义分析模型(生成模型)的对数似然函数是
L
=
∑
i
=
1
M
∑
j
=
1
N
n
(
w
i
,
d
j
)
l
o
g
P
(
w
i
,
d
j
)
L = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) log P(w_i,d_j)
L=i=1∑Mj=1∑Nn(wi,dj)logP(wi,dj)
将式(18.2)代入,得
L
=
∑
i
=
1
M
∑
j
=
1
N
n
(
w
i
,
d
j
)
l
o
g
[
P
(
d
j
)
∑
k
=
1
K
P
(
z
j
∣
d
j
)
P
(
w
j
∣
z
j
)
]
=
∑
i
=
1
M
∑
j
=
1
N
n
(
w
i
,
d
j
)
[
l
o
g
P
(
d
j
)
+
l
o
g
∑
k
=
1
K
P
(
z
j
∣
d
j
)
P
(
w
j
∣
z
j
)
]
=
∑
j
=
1
N
n
(
d
j
)
[
l
o
g
P
(
d
j
)
+
∑
i
=
1
M
n
(
w
i
,
d
j
)
n
(
d
j
)
l
o
g
∑
k
=
1
K
P
(
z
j
∣
d
j
)
P
(
w
j
∣
z
j
)
]
\begin{aligned} L & = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) log \bigg[ P(d_j) \sum_{k=1}^K P(z_j|d_j) P(w_j|z_j) \bigg] \\ & = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) \bigg[ log \ P(d_j) + log \sum_{k=1}^K P(z_j|d_j) P(w_j|z_j) \bigg] \\ & = \sum_{j=1}^N n(d_j)\bigg[ log \ P(d_j) + \sum_{i=1}^M \frac{n(w_i,d_j)}{n(d_j)} log \sum_{k=1}^K P(z_j|d_j) P(w_j|z_j) \bigg] \end{aligned}
L=i=1∑Mj=1∑Nn(wi,dj)log[P(dj)k=1∑KP(zj∣dj)P(wj∣zj)]=i=1∑Mj=1∑Nn(wi,dj)[log P(dj)+logk=1∑KP(zj∣dj)P(wj∣zj)]=j=1∑Nn(dj)[log P(dj)+i=1∑Mn(dj)n(wi,dj)logk=1∑KP(zj∣dj)P(wj∣zj)]
上式即书中式(18.9)大括号中的部分。因其中
∑
j
=
1
N
n
(
d
j
)
l
o
g
P
(
d
j
)
\sum_{j=1}^N n(d_j) log \ P(d_j)
∑j=1Nn(dj)log P(dj)为常量,不影响极大化对数似然函数的结果,故舍去。于是有
L
=
∑
i
=
1
M
∑
j
=
1
N
n
(
w
i
,
d
j
)
l
o
g
[
∑
k
=
1
K
P
(
z
j
∣
d
j
)
P
(
w
j
∣
z
j
)
]
L = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) log \bigg[ \sum_{k=1}^K P(z_j|d_j) P(w_j|z_j) \bigg]
L=i=1∑Mj=1∑Nn(wi,dj)log[k=1∑KP(zj∣dj)P(wj∣zj)]
上式即书中的对数似然函数。
式18.10式的推导(概率潜在语义分析模型生成模型的EM算法的Q函数)
因为概率潜在语义分析模型(生成模型)的参数完全数据为
P
(
w
,
z
,
d
)
=
P
(
w
∣
z
)
P
(
z
∣
d
)
P(w,z,d) = P(w|z)P(z|d)
P(w,z,d)=P(w∣z)P(z∣d),不完全数据为
P
(
z
∣
w
,
d
)
P(z|w,d)
P(z∣w,d)。于是有完全数据的对数似然函数
l
o
g
P
(
w
,
z
,
d
∣
θ
)
=
∑
i
=
1
M
∑
j
=
1
N
∑
k
=
1
K
n
(
w
i
,
d
j
)
l
o
g
P
(
w
i
,
z
k
,
d
j
)
=
∑
i
=
1
M
∑
j
=
1
N
n
(
w
i
,
d
j
)
∑
k
=
1
K
l
o
g
[
P
(
w
j
∣
z
j
)
P
(
z
j
∣
d
j
)
]
\begin{aligned} log \ P(w,z,d|\theta) & = \sum_{i=1}^M \sum_{j=1}^N \sum_{k=1}^K n(w_i,d_j) log P(w_i,z_k,d_j) \\ & = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) \sum_{k=1}^K log \bigg[ P(w_j|z_j) P(z_j|d_j) \bigg] \end{aligned}
log P(w,z,d∣θ)=i=1∑Mj=1∑Nk=1∑Kn(wi,dj)logP(wi,zk,dj)=i=1∑Mj=1∑Nn(wi,dj)k=1∑Klog[P(wj∣zj)P(zj∣dj)]
于是有
Q
(
θ
,
θ
(
i
)
)
=
E
Z
[
l
o
g
P
(
w
,
z
,
d
∣
θ
)
∣
z
,
θ
(
i
)
]
=
E
Z
{
∑
i
=
1
M
∑
j
=
1
N
n
(
w
i
,
d
j
)
∑
k
=
1
K
l
o
g
[
P
(
w
j
∣
z
j
)
P
(
z
j
∣
d
j
)
]
}
=
∑
i
=
1
M
∑
j
=
1
N
n
(
w
i
,
d
j
)
E
Z
{
∑
k
=
1
K
l
o
g
[
P
(
w
j
∣
z
j
)
P
(
z
j
∣
d
j
)
]
}
=
∑
i
=
1
M
∑
j
=
1
N
n
(
w
i
,
d
j
)
∑
k
=
1
K
P
(
z
k
∣
w
i
,
d
j
)
l
o
g
[
P
(
w
j
∣
z
j
)
P
(
z
j
∣
d
j
)
]
\begin{aligned} Q(\theta,\theta^{(i)}) & = E_Z[log \ P(w,z,d|\theta)|z,\theta^{(i)}] \\ & = E_Z \Bigg\{ \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) \sum_{k=1}^K log \bigg[ P(w_j|z_j) P(z_j|d_j) \bigg] \Bigg\} \\ & = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) \ E_Z \Bigg\{ \sum_{k=1}^K log \bigg[ P(w_j|z_j) P(z_j|d_j) \bigg] \Bigg\} \\ & = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) \ \sum_{k=1}^K P(z_k|w_i,d_j) log \bigg[ P(w_j|z_j) P(z_j|d_j) \bigg] \end{aligned}
Q(θ,θ(i))=EZ[log P(w,z,d∣θ)∣z,θ(i)]=EZ{i=1∑Mj=1∑Nn(wi,dj)k=1∑Klog[P(wj∣zj)P(zj∣dj)]}=i=1∑Mj=1∑Nn(wi,dj) EZ{k=1∑Klog[P(wj∣zj)P(zj∣dj)]}=i=1∑Mj=1∑Nn(wi,dj) k=1∑KP(zk∣wi,dj)log[P(wj∣zj)P(zj∣dj)]
上式即书中式(18.10)。
式18.12和式18.13的推导(概率潜在语义分析模型生成模型的EM算法的M步)
令拉格朗日函数 Λ \Lambda Λ分别对 P ( w i ∣ z k ) P(w_i|z_k) P(wi∣zk)和 P ( z k ∣ d j ) P(z_k|d_j) P(zk∣dj)的偏导数为0,得到下面的方程组
∑ j = 1 N n ( w i , d j ) P ( z k ∣ w i , d j ) − τ k P ( w i ∣ z k ) = 0 , i = 1 , 2 , ⋯ , M ; k = 1 , 2 , ⋯ , K (1) \sum_{j=1}^N n(w_i,d_j) P(z_k|w_i,d_j) - \tau_k P(w_i|z_k) = 0, \hspace{1em} i=1,2,\cdots,M; \hspace{1em} k=1,2,\cdots,K \tag{1} j=1∑Nn(wi,dj)P(zk∣wi,dj)−τkP(wi∣zk)=0,i=1,2,⋯,M;k=1,2,⋯,K(1)
∑ i = 1 M n ( w i , d j ) P ( z k ∣ w i , d j ) − ρ j P ( z k ∣ d j ) = 0 , j = 1 , 2 , ⋯ , N ; k = 1 , 2 , ⋯ , K (2) \sum_{i=1}^M n (w_i,d_j) P(z_k|w_i,d_j) - \rho_j P(z_k|d_j) = 0, \hspace{1em} j=1,2,\cdots,N; \hspace{1em} k=1,2,\cdots,K \tag{2} i=1∑Mn(wi,dj)P(zk∣wi,dj)−ρjP(zk∣dj)=0,j=1,2,⋯,N;k=1,2,⋯,K(2)
将式(1)对
i
=
1
,
2
,
⋯
,
M
i=1,2,\cdots,M
i=1,2,⋯,M求和,同时考虑约束条件
∑
i
=
1
M
P
(
w
i
∣
z
k
)
=
1
\sum_{i=1}^M P(w_i|z_k)=1
∑i=1MP(wi∣zk)=1,得
τ
k
=
∑
i
=
1
M
∑
j
=
1
N
n
(
w
i
,
d
j
)
P
(
z
k
∣
w
i
,
d
j
)
∑
i
=
1
M
P
(
w
i
∣
z
k
)
=
∑
i
=
1
M
∑
j
=
1
N
n
(
w
i
,
d
j
)
P
(
z
k
∣
w
i
,
d
j
)
(3)
\tau_k = \frac{\sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) P(z_k|w_i,d_j)}{\sum_{i=1}^M P(w_i|z_k)} = \sum_{i=1}^M \sum_{j=1}^N n(w_i,d_j) P(z_k|w_i,d_j) \tag{3}
τk=∑i=1MP(wi∣zk)∑i=1M∑j=1Nn(wi,dj)P(zk∣wi,dj)=i=1∑Mj=1∑Nn(wi,dj)P(zk∣wi,dj)(3)
类似地,将式(2)对
k
=
1
,
2
,
⋯
,
K
k=1,2,\cdots,K
k=1,2,⋯,K求和,同时考虑约束条件
∑
k
=
1
K
P
(
z
k
∣
d
j
)
=
1
\sum_{k=1}^K P(z_k|d_j)=1
∑k=1KP(zk∣dj)=1,得
KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ \rho_j & = \f…
将式(3)代入式(1)得
P
(
w
i
∣
z
k
)
=
∑
j
=
1
N
n
(
w
i
,
d
j
)
P
(
z
k
∣
w
i
,
d
j
)
∑
m
=
1
M
∑
j
=
1
N
n
(
w
m
,
d
j
)
P
(
z
k
∣
w
m
,
d
j
)
(5)
P(w_i|z_k) = \frac{\sum_{j=1}^N n(w_i,d_j) P(z_k|w_i,d_j)}{\sum_{m=1}^M \sum_{j=1}^N n(w_m,d_j) P(z_k|w_m,d_j)} \tag{5}
P(wi∣zk)=∑m=1M∑j=1Nn(wm,dj)P(zk∣wm,dj)∑j=1Nn(wi,dj)P(zk∣wi,dj)(5)
上式即书中式(18.12)。
将式(4)代入式(2)得
P
(
z
k
∣
d
j
)
=
∑
i
=
1
M
n
(
w
i
,
d
j
)
P
(
z
k
∣
w
i
,
d
j
)
n
(
d
j
)
(6)
P(z_k|d_j) = \frac{\sum_{i=1}^M n (w_i,d_j) P(z_k|w_i,d_j)}{n(d_j)} \tag{6}
P(zk∣dj)=n(dj)∑i=1Mn(wi,dj)P(zk∣wi,dj)(6)
上式即书中式(18.13)。
概率潜在语义模型参数估计的EM算法(原生Python实现)
import numpy as np
def em_for_plsa(X, K, max_iter=100, random_state=0):
"""概率潜在语义模型参数估计的EM算法
:param X: 单词-文本共现矩阵
:param K: 话题数量
:param max_iter: 最大迭代次数
:param random_state: 随机种子
:return: P(w_i|z_k)和P(z_k|d_j)
"""
n_features, n_samples = X.shape
# 计算n(d_j)
N = [np.sum(X[:, j]) for j in range(n_samples)]
# 设置参数P(w_i|z_k)和P(z_k|d_j)的初始值
np.random.seed(random_state)
P1 = np.random.random((n_features, K)) # P(w_i|z_k)
P2 = np.random.random((K, n_samples)) # P(z_k|d_j)
for _ in range(max_iter):
# E步
P = np.zeros((n_features, n_samples, K))
for i in range(n_features):
for j in range(n_samples):
for k in range(K):
P[i][j][k] = P1[i][k] * P2[k][j]
P[i][j] /= np.sum(P[i][j])
# M步
for k in range(K):
for i in range(n_features):
P1[i][k] = np.sum([X[i][j] * P[i][j][k] for j in range(n_samples)])
P1[:, k] /= np.sum(P1[:, k])
for k in range(K):
for j in range(n_samples):
P2[k][j] = np.sum([X[i][j] * P[i][j][k] for i in range(n_features)]) / N[j]
return P1, P2
【测试】习题18.3(例17.2.2)
if __name__ == "__main__":
X = np.array([[0, 0, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 0, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 0, 2, 0, 0, 1],
[1, 0, 1, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 1, 0, 0, 0, 0]])
np.set_printoptions(precision=2, suppress=True)
R1, R2 = em_for_plsa(X, 3)
print(R1)
# [[0. 0.15 0. ]
# [0.15 0. 0. ]
# [0. 0. 0.4 ]
# [0.15 0. 0. ]
# [0.08 0.08 0. ]
# [0.23 0.31 0.4 ]
# [0. 0.15 0. ]
# [0.15 0. 0. ]
# [0.23 0. 0. ]
# [0. 0.15 0.2 ]
# [0. 0.15 0. ]]
print(R2)
# [[0. 0. 0. 0. 0. 1. 1. 0. 1.]
# [1. 0. 1. 1. 1. 0. 0. 0. 0.]
# [0. 1. 0. 0. 0. 0. 0. 1. 0.]]