word2vec词向量结果用tsne展示 word2vec词向量表示

关注 mob6454cc7c0428

word2vec词向量结果用tsne展示 word2vec词向量表示

转载

mob6454cc7c0428 2024-08-13 13:44:28

文章标签 List python 词向量 文章分类 机器学习人工智能

词向量（word2vec）原始的代码是C写的，Python也有对应的版本，被集成在一个非常牛逼的框架gensim中。

我在自己的开源语义网络项目graph-mind（其实是我自己写的小玩具）中使用了这些功能，大家可以直接用我在上面做的进一步的封装傻瓜式地完成一些操作，下面分享调用方法和一些code上的心得。

1.一些类成员变量：

[python] view plain copy

1. def __init__(self, modelPath, _size=100, _window=5, _minCount=1, _workers=multiprocessing.cpu_count()):  
2. self.modelPath = modelPath  
3. self._size = _size  
4. self._window = _window  
5. self._minCount = _minCount  
6. self._workers = _workers

modelPath是word2vec训练模型的磁盘存储文件（model在内存中总是不踏实），_size是词向量的维度，_window是词向量训练时的上下文扫描窗口大小，后面那个不知道，按默认来，_workers是训练的进程数（需要更精准的解释，请指正），默认是当前运行机器的处理器核数。这些参数先记住就可以了。

2.初始化并首次训练word2vec模型

完成这个功能的核心函数是initTrainWord2VecModel，传入两个参数：corpusFilePath和safe_model，分别代表训练语料的路径和是否选择“安全模式”进行初次训练。关于这个“安全模式”后面会讲，先看代码：

[python] view plain copy

1. def initTrainWord2VecModel(self, corpusFilePath, safe_model=False):  
2. '''''
3.         init and train a new w2v model
4.         (corpusFilePath can be a path of corpus file or directory or a file directly, in some time it can be sentences directly
5.         about soft_model:
6.             if safe_model is true, the process of training uses update way to refresh model,
7.         and this can keep the usage of os's memory safe but slowly.
8.             and if safe_model is false, the process of training uses the way that load all
9.         corpus lines into a sentences list and train them one time.)
10.         '''
11.         extraSegOpt().reLoadEncoding()  
12.           
13.         fileType = localFileOptUnit.checkFileState(corpusFilePath)  
14. if fileType == u'error':  
15. 'load file error!')  
16. return None
17. else:  
18. None
19. if fileType == u'opened':  
20. print('training model from singleFile!')  
21. self._size, window=self._window, min_count=self._minCount, workers=self._workers)  
22. elif fileType == u'file':  
23. 'r')  
24. print('training model from singleFile!')  
25. self._size, window=self._window, min_count=self._minCount, workers=self._workers)  
26. elif fileType == u'directory':  
27.                 corpusFiles = localFileOptUnit.listAllFileInDirectory(corpusFilePath)  
28. print('training model from listFiles of directory!')  
29. if safe_model == True:  
30. 0]), size=self._size, window=self._window, min_count=self._minCount, workers=self._workers)  
31. for file in corpusFiles[1:len(corpusFiles)]:  
32. self.updateW2VModelUnit(model, file)  
33. else:  
34. self.loadSetencesFromFiles(corpusFiles)  
35. self._size, window=self._window, min_count=self._minCount, workers=self._workers)  
36. elif fileType == u'other':  
37. # TODO add sentences list directly
38. pass
39.                   
40. self.modelPath)  
41.             model.init_sims()  
42. print('producing word2vec model ... ok!')  
43. return

首先是一些杂七杂八的，判断一下输入文件路径下访问结果的类型，根据不同的类型做出不同的文件处理反应，这个大家应该能看懂，以corpusFilePath为一个已经打开的file对象为例，创建word2vec model的代码为：

[python] view plain copy

1. model = Word2Vec(LineSentence(corpusFilePath), size=self._size, window=self._window, min_count=self._minCount, workers=self._workers)

其实就是这么简单，但是为了代码健壮一些，就变成了上面那么长。问题是在面对一个路径下的许多训练文档且数目巨大的时候，一次性载入内存可能不太靠谱了（没有细研究gensim在Word2Vec构造方法中有没有考虑这个问题，只是一种习惯性的警惕），于是我设定了一个参数safe_model用于判断初始训练是否开启“安全模式”，所谓安全模式，就是最初只载入一篇语料的内容，后面的初始训练文档通过增量式学习的方式，更新到原先的model中。

上面的代码里，corpusFilePath可以传入一个已经打开的file对象，或是一个单个文件的地址，或一个文件夹的路径，通过函数checkFileState已经做了类型的判断。另外一个函数是updateW2VModelUnit，用于增量式训练更新w2v的model，下面会具体介绍。loadSetencesFromFiles函数用于载入一个文件夹中全部语料的所有句子，这个在源代码里有，很简单，哥就不多说了。

3.增量式训练更新word2vec模型

增量式训练w2v模型，上面提到了一个这么做的原因：避免把全部的训练语料一次性载入到内存中。另一个原因是为了应对语料随时增加的情况。gensim当然给出了这样的solution，调用如下：

[python] view plain copy

1. def updateW2VModelUnit(self, model, corpusSingleFilePath):  
2. '''''
3.         (only can be a singleFile)
4.         '''
5.         fileType = localFileOptUnit.checkFileState(corpusSingleFilePath)  
6. if fileType == u'directory':  
7. 'can not deal a directory!')  
8. return
9.           
10. if fileType == u'opened':  
11.             trainedWordCount = model.train(LineSentence(corpusSingleFilePath))  
12. print('update model, update words num is: '
13. elif fileType == u'file':  
14. 'r')  
15.             trainedWordCount = model.train(LineSentence(corpusSingleFile))  
16. print('update model, update words num is: '
17. else:  
18. # TODO add sentences list directly (same as last function)
19. pass
20. return

简单检查文件type之后，调用model对象的train方法就可以实现对model的更新，这个方法传入的是新语料的sentences，会返回模型中新增词汇的数量。函数全部执行完后，return更新后的model，源代码中在这个函数下面有能够处理多类文件参数（同2）的增强方法，这里就不多介绍了。

4.各种基础查询

当你确定model已经训练完成，不会再更新的时候，可以对model进行锁定，并且据说是预载了相似度矩阵能够提高后面的查询速度，但是你的model从此以后就read only了。

[python] view plain copy

1. def finishTrainModel(self, modelFilePath=None):  
2. '''''
3.         warning: after this, the model is read-only (can't be update)
4.         '''
5. if modelFilePath == None:  
6. self.modelPath  
7. self.loadModelfromFile(modelFilePath)  
8. True)

可以看到，所谓的锁定模型方法，就是init_sims，并且把里面的replace参数设定为True。

然后是一些word2vec模型的查询方法：

[python] view plain copy

1. def getWordVec(self, model, wordStr):  
2. '''''
3.         get the word's vector as arrayList type from w2v model
4.         '''
5. return

[python] view plain copy

1. def queryMostSimilarWordVec(self, model, wordStr, topN=20):  
2. '''''
3.         MSimilar words basic query function
4.         return 2-dim List [0] is word [1] is double-prob
5.         '''
6. 'utf-8'), topn=topN)  
7. return

[python] view plain copy

1. def culSimBtwWordVecs(self, model, wordStr1, wordStr2):  
2. '''''
3.         two words similar basic query function
4.         return double-prob
5.         '''
6. 'utf-8'), wordStr2.decode('utf-8'))  
7. return

1. def culSimBtwWordVecs(self, model, wordStr1, wordStr2):  
2. '''''
3.         two words similar basic query function
4.         return double-prob
5.         '''
6. 'utf-8'), wordStr2.decode('utf-8'))  
7. return

1. def culSimBtwWordVecs(self, model, wordStr1, wordStr2):  
2. '''''
3.         two words similar basic query function
4.         return double-prob
5.         '''
6. 'utf-8'), wordStr2.decode('utf-8'))  
7. return

1. def culSimBtwWordVecs(self, model, wordStr1, wordStr2):  
2. '''''
3.         two words similar basic query function
4.         return double-prob
5.         '''
6. 'utf-8'), wordStr2.decode('utf-8'))  
7. return

上述方法都很简单，基本上一行解决，在源代码中，各个函数下面依然是配套了相应的model文件处理版的函数。其中，getWordVec是得到查询词的word2vec词向量本身，打印出来是一个纯数字的array；queryMostSimilarWordVec是得到与查询词关联度最高的N个词以及对应的相似度，返回是一个二维list（注释里面写的蛮清楚）；culSimBtwWordVecs是得到两个给定词的相似度值，直接返回double值。

5.Word2Vec词向量的计算

研究过w2v理论的童鞋肯定知道词向量是可以做加减计算的，基于这个性质，gensim给出了相应的方法，调用如下：

[python] view plain copy

1. def queryMSimilarVecswithPosNeg(self, model, posWordStrList, negWordStrList, topN=20):  
2. '''''
3.         pos-neg MSimilar words basic query function
4.         return 2-dim List [0] is word [1] is double-prob
5.         '''
6.         posWordList = []  
7.         negWordList = []  
8. for wordStr in
9. 'utf-8'))  
10. for wordStr in
11. 'utf-8'))  
12.         pnSimilarPairList = model.most_similar(positive=posWordList, negative=negWordList, topn=topN)  
13. return

由于用的是py27，所以之前对传入的词列表数据进行编码过滤，这里面posWordList可以认为是对结果产生正能量的词集，negWordList则是对结果产生负能量的词集，同时送入most_similar方法，在设定return答案的topN，得到的返回结果形式同4中的queryMostSimilarWordVec函数，大家可以这样数学地理解这个操作：

word2vec词向量结果用tsne展示 word2vec词向量表示_List

下面一个操作是我自创的，假设我想用上面词向量topN“词-关联度”的形式展现两个词或两组词之间的关联，我是这么做的：

[python] view plain copy

1. def copeMSimilarVecsbtwWordLists(self, model, wordStrList1, wordStrList2, topN_rev=20, topN=20):  
2. '''''
3.         range word vec res for two wordList from source to target
4.         use wordVector to express the relationship between src-wordList and tag-wordList
5.         first, use the tag-wordList as neg-wordList to get the rev-wordList,
6.         then use the scr-wordList and the rev-wordList as the new src-tag-wordList
7.         topN_rev is topN of rev-wordList and topN is the final topN of relationship vec
8.         '''
9.         srcWordList = []  
10.         tagWordList = []  
11. 'utf-8') for wordStr in
12. 'utf-8') for wordStr in
13.           
14. self.queryMSimilarVecswithPosNeg(model, [], tagWordList, topN_rev)  
15.         revWordList = []  
16. 0].decode('utf-8') for pair in
17. self.queryMSimilarVecswithPosNeg(model, srcWordList, revWordList, topN)  
18. return

这个操作的思路就是，首先用两组词中的一组作为negWordList，传入上面的queryMSimilarVecswithPosNeg函数，得到topN一组的中转词，在使用这些中转词与原先的另一组词进行queryMSimilarVecswithPosNeg操作，很容易理解，第一步得到的是一组词作为negWordList的反向结果，再通过这个反向结果与另一组词得到“负负得正”的效果。这样就可以通过一组topN的“词-关联度”配对List表示两组词之间的关系。

更多的细节可以查看我的开源项目graph-mind源代码，不过肯定会有问题，希望大家能多和我交流，直接回复最好，微信：superhy199148，email：superhy199148@hotmail.com。

如果有更深入的成果，我会及时Blog。

词向量（word2vec）原始的代码是C写的，Python也有对应的版本，被集成在一个非常牛逼的框架gensim中。

我在自己的开源语义网络项目graph-mind（其实是我自己写的小玩具）中使用了这些功能，大家可以直接用我在上面做的进一步的封装傻瓜式地完成一些操作，下面分享调用方法和一些code上的心得。

1.一些类成员变量：

[python] view plain copy

1. def __init__(self, modelPath, _size=100, _window=5, _minCount=1, _workers=multiprocessing.cpu_count()):  
2. self.modelPath = modelPath  
3. self._size = _size  
4. self._window = _window  
5. self._minCount = _minCount  
6. self._workers = _workers

modelPath是word2vec训练模型的磁盘存储文件（model在内存中总是不踏实），_size是词向量的维度，_window是词向量训练时的上下文扫描窗口大小，后面那个不知道，按默认来，_workers是训练的进程数（需要更精准的解释，请指正），默认是当前运行机器的处理器核数。这些参数先记住就可以了。

2.初始化并首次训练word2vec模型

完成这个功能的核心函数是initTrainWord2VecModel，传入两个参数：corpusFilePath和safe_model，分别代表训练语料的路径和是否选择“安全模式”进行初次训练。关于这个“安全模式”后面会讲，先看代码：

[python] view plain copy

1. def initTrainWord2VecModel(self, corpusFilePath, safe_model=False):  
2. '''''
3.         init and train a new w2v model
4.         (corpusFilePath can be a path of corpus file or directory or a file directly, in some time it can be sentences directly
5.         about soft_model:
6.             if safe_model is true, the process of training uses update way to refresh model,
7.         and this can keep the usage of os's memory safe but slowly.
8.             and if safe_model is false, the process of training uses the way that load all
9.         corpus lines into a sentences list and train them one time.)
10.         '''
11.         extraSegOpt().reLoadEncoding()  
12.           
13.         fileType = localFileOptUnit.checkFileState(corpusFilePath)  
14. if fileType == u'error':  
15. 'load file error!')  
16. return None
17. else:  
18. None
19. if fileType == u'opened':  
20. print('training model from singleFile!')  
21. self._size, window=self._window, min_count=self._minCount, workers=self._workers)  
22. elif fileType == u'file':  
23. 'r')  
24. print('training model from singleFile!')  
25. self._size, window=self._window, min_count=self._minCount, workers=self._workers)  
26. elif fileType == u'directory':  
27.                 corpusFiles = localFileOptUnit.listAllFileInDirectory(corpusFilePath)  
28. print('training model from listFiles of directory!')  
29. if safe_model == True:  
30. 0]), size=self._size, window=self._window, min_count=self._minCount, workers=self._workers)  
31. for file in corpusFiles[1:len(corpusFiles)]:  
32. self.updateW2VModelUnit(model, file)  
33. else:  
34. self.loadSetencesFromFiles(corpusFiles)  
35. self._size, window=self._window, min_count=self._minCount, workers=self._workers)  
36. elif fileType == u'other':  
37. # TODO add sentences list directly
38. pass
39.                   
40. self.modelPath)  
41.             model.init_sims()  
42. print('producing word2vec model ... ok!')  
43. return

首先是一些杂七杂八的，判断一下输入文件路径下访问结果的类型，根据不同的类型做出不同的文件处理反应，这个大家应该能看懂，以corpusFilePath为一个已经打开的file对象为例，创建word2vec model的代码为：

[python] view plain copy

1. model = Word2Vec(LineSentence(corpusFilePath), size=self._size, window=self._window, min_count=self._minCount, workers=self._workers)

其实就是这么简单，但是为了代码健壮一些，就变成了上面那么长。问题是在面对一个路径下的许多训练文档且数目巨大的时候，一次性载入内存可能不太靠谱了（没有细研究gensim在Word2Vec构造方法中有没有考虑这个问题，只是一种习惯性的警惕），于是我设定了一个参数safe_model用于判断初始训练是否开启“安全模式”，所谓安全模式，就是最初只载入一篇语料的内容，后面的初始训练文档通过增量式学习的方式，更新到原先的model中。

上面的代码里，corpusFilePath可以传入一个已经打开的file对象，或是一个单个文件的地址，或一个文件夹的路径，通过函数checkFileState已经做了类型的判断。另外一个函数是updateW2VModelUnit，用于增量式训练更新w2v的model，下面会具体介绍。loadSetencesFromFiles函数用于载入一个文件夹中全部语料的所有句子，这个在源代码里有，很简单，哥就不多说了。

3.增量式训练更新word2vec模型

增量式训练w2v模型，上面提到了一个这么做的原因：避免把全部的训练语料一次性载入到内存中。另一个原因是为了应对语料随时增加的情况。gensim当然给出了这样的solution，调用如下：

[python] view plain copy

1. def updateW2VModelUnit(self, model, corpusSingleFilePath):  
2. '''''
3.         (only can be a singleFile)
4.         '''
5.         fileType = localFileOptUnit.checkFileState(corpusSingleFilePath)  
6. if fileType == u'directory':  
7. 'can not deal a directory!')  
8. return
9.           
10. if fileType == u'opened':  
11.             trainedWordCount = model.train(LineSentence(corpusSingleFilePath))  
12. print('update model, update words num is: '
13. elif fileType == u'file':  
14. 'r')  
15.             trainedWordCount = model.train(LineSentence(corpusSingleFile))  
16. print('update model, update words num is: '
17. else:  
18. # TODO add sentences list directly (same as last function)
19. pass
20. return

简单检查文件type之后，调用model对象的train方法就可以实现对model的更新，这个方法传入的是新语料的sentences，会返回模型中新增词汇的数量。函数全部执行完后，return更新后的model，源代码中在这个函数下面有能够处理多类文件参数（同2）的增强方法，这里就不多介绍了。

4.各种基础查询

当你确定model已经训练完成，不会再更新的时候，可以对model进行锁定，并且据说是预载了相似度矩阵能够提高后面的查询速度，但是你的model从此以后就read only了。

[python] view plain copy

1. def finishTrainModel(self, modelFilePath=None):  
2. '''''
3.         warning: after this, the model is read-only (can't be update)
4.         '''
5. if modelFilePath == None:  
6. self.modelPath  
7. self.loadModelfromFile(modelFilePath)  
8. True)

可以看到，所谓的锁定模型方法，就是init_sims，并且把里面的replace参数设定为True。

然后是一些word2vec模型的查询方法：

[python] view plain copy

1. def getWordVec(self, model, wordStr):  
2. '''''
3.         get the word's vector as arrayList type from w2v model
4.         '''
5. return

[python] view plain copy

1. def queryMostSimilarWordVec(self, model, wordStr, topN=20):  
2. '''''
3.         MSimilar words basic query function
4.         return 2-dim List [0] is word [1] is double-prob
5.         '''
6. 'utf-8'), topn=topN)  
7. return

[python] view plain copy

1. def culSimBtwWordVecs(self, model, wordStr1, wordStr2):  
2. '''''
3.         two words similar basic query function
4.         return double-prob
5.         '''
6. 'utf-8'), wordStr2.decode('utf-8'))  
7. return

上述方法都很简单，基本上一行解决，在源代码中，各个函数下面依然是配套了相应的model文件处理版的函数。其中，getWordVec是得到查询词的word2vec词向量本身，打印出来是一个纯数字的array；queryMostSimilarWordVec是得到与查询词关联度最高的N个词以及对应的相似度，返回是一个二维list（注释里面写的蛮清楚）；culSimBtwWordVecs是得到两个给定词的相似度值，直接返回double值。

5.Word2Vec词向量的计算

研究过w2v理论的童鞋肯定知道词向量是可以做加减计算的，基于这个性质，gensim给出了相应的方法，调用如下：

[python] view plain copy

1. def queryMSimilarVecswithPosNeg(self, model, posWordStrList, negWordStrList, topN=20):  
2. '''''
3.         pos-neg MSimilar words basic query function
4.         return 2-dim List [0] is word [1] is double-prob
5.         '''
6.         posWordList = []  
7.         negWordList = []  
8. for wordStr in
9. 'utf-8'))  
10. for wordStr in
11. 'utf-8'))  
12.         pnSimilarPairList = model.most_similar(positive=posWordList, negative=negWordList, topn=topN)  
13. return

由于用的是py27，所以之前对传入的词列表数据进行编码过滤，这里面posWordList可以认为是对结果产生正能量的词集，negWordList则是对结果产生负能量的词集，同时送入most_similar方法，在设定return答案的topN，得到的返回结果形式同4中的queryMostSimilarWordVec函数，大家可以这样数学地理解这个操作：

word2vec词向量结果用tsne展示 word2vec词向量表示_List

下面一个操作是我自创的，假设我想用上面词向量topN“词-关联度”的形式展现两个词或两组词之间的关联，我是这么做的：

[python] view plain copy

1. def copeMSimilarVecsbtwWordLists(self, model, wordStrList1, wordStrList2, topN_rev=20, topN=20):  
2. '''''
3.         range word vec res for two wordList from source to target
4.         use wordVector to express the relationship between src-wordList and tag-wordList
5.         first, use the tag-wordList as neg-wordList to get the rev-wordList,
6.         then use the scr-wordList and the rev-wordList as the new src-tag-wordList
7.         topN_rev is topN of rev-wordList and topN is the final topN of relationship vec
8.         '''
9.         srcWordList = []  
10.         tagWordList = []  
11. 'utf-8') for wordStr in
12. 'utf-8') for wordStr in
13.           
14. self.queryMSimilarVecswithPosNeg(model, [], tagWordList, topN_rev)  
15.         revWordList = []  
16. 0].decode('utf-8') for pair in
17. self.queryMSimilarVecswithPosNeg(model, srcWordList, revWordList, topN)  
18. return

这个操作的思路就是，首先用两组词中的一组作为negWordList，传入上面的queryMSimilarVecswithPosNeg函数，得到topN一组的中转词，在使用这些中转词与原先的另一组词进行queryMSimilarVecswithPosNeg操作，很容易理解，第一步得到的是一组词作为negWordList的反向结果，再通过这个反向结果与另一组词得到“负负得正”的效果。这样就可以通过一组topN的“词-关联度”配对List表示两组词之间的关系。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

赞
收藏
评论
分享
举报

上一篇：linux的java版本和docker中的是一样吗 linux和java有关系吗

下一篇：python 边缘检测 sobel 边缘检测算法

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

举报文章

请选择举报类型

内容侵权涉嫌营销内容抄袭违法信息其他

具体原因

包含不真实信息涉及个人隐私

原文链接（必填）

补充说明

0/200

上传截图

格式支持JPEG/PNG/JPG，图片不超过1.9M

已经收到您得举报信息，我们会尽快审核

鸿蒙开发者社区

WOT技术大会

公众号矩阵

移动端

短视频免费课程课程排行直播课软考学堂

全部课程厂商认证 IT技术 24年11月软考 PMP项目管理免费题库

在线学习

文章资源问答课堂专栏直播

51CTO

鸿蒙开发者社区

51CTO技术栈

51CTO官微

51CTO学堂

51CTO博客

CTO训练营

鸿蒙开发者社区订阅号

51CTO软考

51CTO学堂APP

51CTO学堂企业版APP

鸿蒙开发者社区视频号

51CTO软考题库

51CTO博客

首页
关注
排行榜
精品课程
免费直播
软考题库

科目全、试题精、讲解专业，扫码免费刷

搜索历史清空

热门搜索

查看【】的结果
写文章
创作中心
登录注册