r语言数据结构字典 r语言jiebar

转载

架构魔法师 2023-10-22 10:04:07

文章标签 r语言数据结构字典 R 中文分词 jiebaR linux 文章分类 R语言后端开发

简介

“结巴”中文分词的R语言版本，支持最大概率法（Maximum Probability）, 隐式马尔科夫模型（Hidden Markov Model）, 索引模型（QuerySegment）, 混合模型（MixSegment）, 共四种分词模式，同时有词性标注，关键词提取，文本Simhash相似度比较等功能。项目使用了Rcpp和CppJieba进行开发。

特性

支持 Windows , Linux操作系统（Mac 暂未测试）。
通过Rcpp Modules实现同时加载多个分词系统,可以分别使用不同的分词模式和词库。
支持多种分词模式、中文姓名识别、关键词提取、词性标注以及文本Simhash相似度比较等功能。
支持加载自定义用户词库，设置词频、词性。
同时支持简体中文、繁体中文分词。
支持自动判断编码模式。
比原”结巴”中文分词速度快，是其他R分词包的5-20倍。
安装简单，无需复杂设置。
可以通过Rpy2，jvmr等被其他语言调用。
基于MIT协议。

安装

目前该包还已经发布到CRAN，也可以通过Github进行安装。
*注：本文是ubuntu安装环境

install.package("jiebaR")
library(jiebaR)

# 或者在 Github上安装
install.packages("devtools")
library(devtools)
install_github("qinwf/jiebaR")
library(jiebaR)

使用

jiebaR提供了四种分词模式，可以通过jiebar()来初始化分词引擎，使用segment()进行分词。

library(jiebaR)
#  接受默认参数，建立分词引擎 
mixseg = worker()
# 相当于：
# worker( type = "mix", dict = "inst/dict/jieba.dict.utf8",
#         hmm  = "inst/dict/hmm_model.utf8",    # HMM模型数据
#         user = "inst/dict/user.dict.utf8")    # 用户自定义词库
# Initialize jiebaR worker 初始化worker
This function can initialize jiebaR workers. You can initialize different kinds of workers including mix, mp, hmm, query, tag, simhash, and keywords.

mixseg <= "广东省深圳市联通"    # <= 分词运算符
# 相当于segment函数，看起来还是用segment函数顺眼一些
segment(code= "广东省深圳市联通" , jiebar = mixseg)
# code A Chinese sentence or the path of a text file.
# jiebar jiebaR Worker

# 分词结果
# [1] "广东省" "深圳市" "联通" 
mixseg <= "你知道我不知道"
# [1] "你"   "知道" "我"   "不"   "知道"
mixseg <= "我昨天参加了同学婚礼"
# [1] "我"   "昨天" "参加" "了"   "同学" "婚礼"
呵呵：分词结果还算不错

支持对文件进行分词

mixseg <= "/cwj/thunder/jieba.txt"
# 这是生成的结果
[1] "/cwj/thunder/jieba.segment1424152152.15737.txt"

那么进入linux路径下看一看生成的结果

root@r-test:/cwj/thunder# ls -l
total 550636
-rw-r--r-- 1 root    root    267025364 Feb 14 21:44 a
-rw-r--r-- 1 root    root    267025364 Feb 15 02:19 a1
drwxr-xr-x 3 root    root        65536 Feb 15 08:51 aliOSS
-rw-r--r-- 1 root    root      2144493 Feb 15 03:14 b
-rw-r--r-- 1 root    root           38 Feb 15 01:04 file1
-rw-r--r-- 1 root    root           25 Feb 14 22:12 file2
-rw-r--r-- 1 root    root            0 Feb 15 22:54 file3
-rw-r--r-- 1 rstudio rstudio  12027150 Feb 15 02:55 hbase_meta_cp_TV_movie
-rw-r--r-- 1 root    root     12027150 Feb 15 02:55 hbase_meta_cp_TV_movie1
-rw-r--r-- 1 root    root      3499750 Feb 15 03:55 hbase_meta_cp_TV_movie.gz
-rw-r--r-- 1 rstudio rstudio       135 Feb 17 00:49 jieba.segment1424152152.15737.txt
-rw-r--r-- 1 root    root          125 Feb 17 00:47 jieba.txt
---x--x--x 1 root    root          406 Feb 15 05:55 mergeData.sh

root@r-test:/cwj/thunder# cat jieba.txt 
今天是一个伟大的日子，我们的人民是幸福的，我们的日子是美好的，我们的生活是甜蜜的。
root@r-test:/cwj/thunder# cat jieba.segment1424152152.15737.txt
今天 是 一个 伟大 的 日子 我们 的 人民 是 幸福 的 我们 的 日子 是 美好 的 我们 的 生活 是 甜蜜 的

注意：
在加载分词引擎时，可以自定义词库路径，同时可以启动不同的引擎：

最大概率法（MPSegment），负责根据Trie树构建有向无环图和进行动态规划算法，是分词算法的核心。

隐式马尔科夫模型（HMMSegment）是根据基于人民日报等语料库构建的HMM模型来进行分词，主要算法思路是根据(B,E,M,S)四个状态来代表每个字的隐藏状态。 HMM模型由dict/hmm_model.utf8提供。分词算法即viterbi算法。

混合模型（MixSegment）是四个分词引擎里面分词效果较好的类，结它合使用最大概率法和隐式马尔科夫模型。

索引模型（QuerySegment）先使用混合模型进行切词，再对于切出来的较长的词，枚举句子中所有可能成词的情况，找出词库里存在。

mixseg2 = worker(type  = "mix", 
                 dict = "/home/rstudio/R/x86_64-pc-linux-gnu-library/3.1/jiebaRD/dict/jieba.dict.utf8", 
                 hmm   = "/home/rstudio/R/x86_64-pc-linux-gnu-library/3.1/jiebaRD/dict/hmm_model.utf8", 
                 user  = "/home/rstudio/R/x86_64-pc-linux-gnu-library/3.1/jiebaRD/dict/user.dict.utf8", 
                 detect=T,  symbol = F, 
                 lines = 1e+05, output = NULL
                 )  
# detect 自动检查文件编码，lines一次读取文件的行数

# 输出worker的设置
mixseg2
#输出结果如下：   
Worker Type:  Mix Segment

Detect Encoding :  TRUE
Default Encoding:  UTF-8
Keep Symbols    :  FALSE
Output Path     :  
Write File      :  TRUE
By Lines        :  FALSE
Max Read Lines  :  1e+05

Fixed Model Components:  

$dict
[1] "/home/rstudio/R/x86_64-pc-linux-gnu-library/3.1/jiebaRD/dict/jieba.dict.utf8"

$hmm
[1] "/home/rstudio/R/x86_64-pc-linux-gnu-library/3.1/jiebaRD/dict/hmm_model.utf8"

$user
[1] "/home/rstudio/R/x86_64-pc-linux-gnu-library/3.1/jiebaRD/dict/user.dict.utf8"

$timestamp
[1] 1424155543

$detect $encoding $symbol $output $write $lines $bylines can be reset.

可以自定义用户词库

ShowDictPath()  # 显示词典路径
EditDict()      # 编辑用户词典
?EditDict()     # 打开帮助系统

Usage # 使用方法
edit_dict(name = "user") # 这个方法过时了
EditDict(name = "user") 
Arguments # 参数
name    
the name of dictionary including user, system, stop_word.

词性标注

cutter = worker(type = "tag")
cutter_words <- cutter <= "我爱北京天安门"
cutter_words
       r        v       ns       ns 
     "我"     "爱"     "北京"     "天安门" 
# "我"  反身代词； "爱" 动词； "北京" 名词

关键词提取

关键词提取所使用逆向文件频率（IDF）。文本语料库可以切换成自定义语料库的路径,使用方法与分词类似。topn参数为关键词的个数。

cutter = worker(type = "keywords", topn = 2)
cutter_words <- cutter <= "我爱北京天安门"
cutter_words
  8.9954   4.6674 
  "天安门"   "北京"
# 根据IDF算法，"我" "爱" 的逆文本频率过低，topn=2，就被过滤掉了

simhash计算

cutter = worker(type = "simhash", topn = 2)
cutter_words <- cutter <= "我爱北京天安门"
cutter_words
$simhash
[1] "4352745221754575559"

$keyword
  8.9954   4.6674 
"天安门"   "北京"

simhash参考

参考文献

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。