前两篇博文分别详细讲解了文本主题模型LDA的基础原理知识、基于Gibbs参数求解的详细过程,都是理论层面的,这篇博文我们来看看代码实现。
LDA的原理过程回顾
LDA生成过程
LDA模型认为一篇文章的每个词都是通过“以一定概率选择了某个主题,并从这个主题中以一定概率选择某个词语”这样一个过程得到。文档到主题服从多项式分布,主题到词服从多项式分布。每一篇文档代表了一些主题所构成的一个概率分布,而每一个主题又代表了很多单词所构成的一个概率分布。
Gibbs Sampling学习LDA
Gibbs Sampling 是Markov-Chain Monte Carlo算法的一个特例。这个算法的运行方式是每次选取概率向量的一个维度,给定其他维度的变量值Sample当前维度的值。不断迭代,直到收敛输出待估计的参数。初始时随机给文本中的每个单词分配主题,然后统计每个主题下出现的词 的数量以及每个文档m下出现主题中的词的数量,每一轮计算,即排除当前词的主题分配,根据其他所有词的主题分配估计当前词属于各个主题的概率。当得到当前词属于所有主题的概率分布后,根据这个概率分布为该词sample一个新的主题。然后用同样的方法不断更新下一个词的主题,直到发现每个文档下Topic分布和每个Topic下词的分布收敛,算法停止。输出待估计的参数和,最终每个单词的主题也同时得出。
实际应用中会设置最大迭代次数。每一次计算的公式称为Gibbs updating rule,由上一篇博客可知,
用Gibbs Sampling 学习LDA参数的算法伪代码如下:
Input:word vectors ,hyperparameters topic number K
Global data:count statistics and their sums memory for full conditional array
Output:topic associations ,multinomial parameters and ,hyperparemeters estimates
//initialisation
zero all count variables,
for all documents m[1,M] do:
for all words n[1,] in document m do:
sample topic index Mult(1/K)
increment document–topic count:+=1
increment document–topic sum:+=1
increment topic–term count:+=1
increment topic–term sum:+=1
// Gibbs sampling over burn-in period and sampling period
while not finished do:
for all documents m[1,M] do:
for all words n[1,] in document m do:
// for the current assignment of to a term t for word
decrement counts and sums:-=1;-=1;-=1;-=1
//multinomial sampling
sampling topic index
//for the new assignment of to the term t for word
increment counts and sums :+=1;+=1;+=1;+=1
LDA的代码实现
import random,os
alpha = 0.1
beta = 0.1
K = 10
iter_num = 50
top_words = 20
wordmapfile = './model/wordmap.txt'
trnfile = "./model/test.dat"
modelfile_suffix = "./model/final"
class Document(object):
def __init__(self):
self.words = []
self.length = 0
class Dataset(object):
def __init__(self):
self.M = 0
self.V = 0
self.docs = []
self.word2id = {} # <string,int>字典
self.id2word = {} # <int, string>字典
def writewordmap(self):
with open(wordmapfile, 'w') as f:
for k,v in self.word2id.items():
f.write(k + '\t' + str(v) + '\n')
class Model(object):
def __init__(self, dset):
self.dset = dset
self.K = K
self.alpha = alpha
self.beta = beta
self.iter_num = iter_num
self.top_words = top_words
self.wordmapfile = wordmapfile
self.trnfile = trnfile
self.modelfile_suffix = modelfile_suffix
self.p = [] # double类型,存储采样的临时变量
self.Z = [] # M*doc.size(),文档中词的主题分布
self.nw = [] # V*K,词i在主题j上的分布
self.nwsum = [] # K,属于主题i的总词数
self.nd = [] # M*K,文章i属于主题j的词个数
self.ndsum = [] # M,文章i的词个数
self.theta = [] # 文档-主题分布
self.phi = [] # 主题-词分布
def init_est(self):
self.p = [0.0 for x in range(self.K)]
self.nw = [ [0 for y in range(self.K)] for x in range(self.dset.V) ]
self.nwsum = [ 0 for x in range(self.K)]
self.nd = [ [ 0 for y in range(self.K)] for x in range(self.dset.M)]
self.ndsum = [ 0 for x in range(self.dset.M)]
self.Z = [ [] for x in range(self.dset.M)]
for x in range(self.dset.M):
self.Z[x] = [0 for y in range(self.dset.docs[x].length)]
self.ndsum[x] = self.dset.docs[x].length
for y in range(self.dset.docs[x].length):
topic = random.randint(0, self.K-1)
self.Z[x][y] = topic
self.nw[self.dset.docs[x].words[y]][topic] += 1
self.nd[x][topic] += 1
self.nwsum[topic] += 1
self.theta = [ [0.0 for y in range(self.K)] for x in range(self.dset.M) ]
self.phi = [ [ 0.0 for y in range(self.dset.V) ] for x in range(self.K)]
def estimate(self):
print ('Sampling %d iterations!' % self.iter_num)
for x in range(self.iter_num):
print( 'Iteration %d ...' % (x+1))
for i in range(len(self.dset.docs)):
for j in range(self.dset.docs[i].length):
topic = self.sampling(i, j)
self.Z[i][j] = topic
print ('End sampling.')
print ('Compute theta...')
self.compute_theta()
print ('Compute phi...')
self.compute_phi()
print ('Saving model...')
self.save_model()
def sampling(self, i, j):
topic = self.Z[i][j]
wid = self.dset.docs[i].words[j]
self.nw[wid][topic] -= 1
self.nd[i][topic] -= 1
self.nwsum[topic] -= 1
self.ndsum[i] -= 1
Vbeta = self.dset.V * self.beta
Kalpha = self.K * self.alpha
for k in range(self.K):
self.p[k] = (self.nw[wid][k] + self.beta)/(self.nwsum[k] + Vbeta) * \
(self.nd[i][k] + alpha)/(self.ndsum[i] + Kalpha)
for k in range(1, self.K):
self.p[k] += self.p[k-1]
u = random.uniform(0, self.p[self.K-1])
for topic in range(self.K):
if self.p[topic]>u:
break
self.nw[wid][topic] += 1
self.nwsum[topic] += 1
self.nd[i][topic] += 1
self.ndsum[i] += 1
return topic
def compute_theta(self):
for x in range(self.dset.M):
for y in range(self.K):
self.theta[x][y] = (self.nd[x][y] + self.alpha) \
/(self.ndsum[x] + self.K * self.alpha)
def compute_phi(self):
for x in range(self.K):
for y in range(self.dset.V):
self.phi[x][y] = (self.nw[y][x] + self.beta)\
/(self.nwsum[x] + self.dset.V * self.beta)
def save_model(self):
with open(self.modelfile_suffix+'.theta', 'w') as ftheta:
for x in range(self.dset.M):
for y in range(self.K):
ftheta.write(str(self.theta[x][y]) + ' ')
ftheta.write('\n')
with open(self.modelfile_suffix+'.phi', 'w') as fphi:
for x in range(self.K):
for y in range(self.dset.V):
fphi.write(str(self.phi[x][y]) + ' ')
fphi.write('\n')
with open(self.modelfile_suffix+'.twords','w') as ftwords:
if self.top_words > self.dset.V:
self.top_words = self.dset.V
for x in range(self.K):
ftwords.write('Topic '+str(x)+'th:\n')
topic_words = []
for y in range(self.dset.V):
topic_words.append((y, self.phi[x][y]))
#quick-sort
topic_words.sort(key=lambda x:x[1], reverse=True)
for y in range(self.top_words):
word = self.dset.id2word[topic_words[y][0]]
ftwords.write('\t'+word+'\t'+str(topic_words[y][1])+'\n')
with open(self.modelfile_suffix+'.tassign','w') as ftassign:
for x in range(self.dset.M):
for y in range(self.dset.docs[x].length):
ftassign.write(str(self.dset.docs[x].words[y])+':'+str(self.Z[x][y])+' ')
ftassign.write('\n')
with open(self.modelfile_suffix+'.others','w') as fothers:
fothers.write('alpha = '+str(self.alpha)+'\n')
fothers.write('beta = '+str(self.beta)+'\n')
fothers.write('ntopics = '+str(self.K)+'\n')
fothers.write('ndocs = '+str(self.dset.M)+'\n')
fothers.write('nwords = '+str(self.dset.V)+'\n')
fothers.write('liter = '+str(self.iter_num)+'\n')
def readtrnfile():
print ('Reading train data...')
with open(trnfile, 'r') as f:
docs = f.readlines()
dset = Dataset()
items_idx = 0
for line in docs:
if line != "":
tmp = line.strip().split()
#生成一个文档对象
doc = Document()
for item in tmp:
if item in dset.word2id:
doc.words.append(dset.word2id[item])
else:
dset.word2id[item] = items_idx
dset.id2word[items_idx] = item
doc.words.append(items_idx)
items_idx += 1
doc.length = len(tmp)
dset.docs.append(doc)
else:
pass
dset.M = len(dset.docs)
dset.V = len(dset.word2id)
print ('There are %d documents' % dset.M)
print ('There are %d items' % dset.V)
print ('Saving wordmap file...')
dset.writewordmap()
return dset
def lda():
dset = readtrnfile()
model = Model(dset)
model.init_est()
model.estimate()
if __name__=='__main__':
lda()
另外,LDA超参数的确定,可以参考这几个链接:【1】【2】【3】 需要完整的数据文件代码,可以留言。