IK分词器

下载地址:​​https://github.com/medcl/elasticsearch-analysis-ik​

也可以在这个地址选择:​​https://github.com/medcl/elasticsearch-analysis-ik/releases​​​
这个下载下来了可以直接使用, 所以推荐下载这个

选择elasticsearch对应版本的分词器进行下载

ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_ide


进入到对应页面下载

ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_ide_02


找到下载好的文件,右键,解压到当前文件夹

ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_analyzer_03


进入文件夹,cmd进入dos窗口,使用maven打包

ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_java程序_04


输入命令,打包,前提是安装好了maven

ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_Ik_05


命令:

mvn package

打包好了过后,当前目录多了一个target文件夹,点击进入

ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_analyzer_06


点击进入releases文件夹

ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_elasticsearch_07


右键,解压到当前文件夹

ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_Ik_08


进入解压后的文件夹,复制所有文件

ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_analyzer_09


找到elasticsearch安装目录,在plugins文件夹下面新建ik(任意取名,方便记忆)文件夹,把刚才复制的文件粘贴到ik文件夹下面

ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_ide_10


ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_analyzer_11

拼音分词器

下载地址:​​https://github.com/medcl/elasticsearch-analysis-pinyin​

也可以在这个地址选择:​​https://github.com/medcl/elasticsearch-analysis-pinyin/releases​

下载,安装过程和ik分词器一模一样,参考上面步骤

最终结果

ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_java程序_12

测试分词效果

elasticsearch自带分词器效果

GET http://localhost:9200/_analyze?pretty=true
{
"analyzer" : "standard",
"text" : "我是一名java程序员"

分词效果如下:

{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0}
,
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1}
,
{
"token": "一",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2}
,
{
"token": "名",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3}
,
{
"token": "java",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 4}
,
{
"token": "程",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 5}
,
{
"token": "序",
"start_offset": 9,
"end_offset": 10,
"type": "<IDEOGRAPHIC>",
"position": 6}
,
{
"token": "员",
"start_offset": 10,
"end_offset": 11,
"type": "<IDEOGRAPHIC>",
"position": 7}
]
}

使用ik_max_word分词

ik_max_word :会将文本做最细粒度的拆分;尽可能多的拆分出词语

GET http://localhost:9200/_analyze?pretty=true
{
"analyzer" : "ik_max_word",
"text" : "我是一名java程序员"

效果如下:

{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0}
,
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1}
,
{
"token": "一名",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2}
,
{
"token": "一",
"start_offset": 2,
"end_offset": 3,
"type": "TYPE_CNUM",
"position": 3}
,
{
"token": "名",
"start_offset": 3,
"end_offset": 4,
"type": "COUNT",
"position": 4}
,
{
"token": "java",
"start_offset": 4,
"end_offset": 8,
"type": "ENGLISH",
"position": 5}
,
{
"token": "程序员",
"start_offset": 8,
"end_offset": 11,
"type": "CN_WORD",
"position": 6}
,
{
"token": "程序",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 7}
,
{
"token": "员",
"start_offset": 10,
"end_offset": 11,
"type": "CN_CHAR",
"position": 8}
]
}

使用ik_smart分词

ik_smart:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有

GET http://localhost:9200/_analyze?pretty=true
{
"analyzer" : "ik_smart",
"text" : "我是一名java程序员"

分词效果如下:

{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0}
,
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1}
,
{
"token": "一名",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2}
,
{
"token": "java",
"start_offset": 4,
"end_offset": 8,
"type": "ENGLISH",
"position": 3}
,
{
"token": "程序员",
"start_offset": 8,
"end_offset": 11,
"type": "CN_WORD",
"position": 4}
]
}

使用pinyin分词

http://localhost:9200/_analyze?pretty=true
{
"analyzer" : "pinyin",
"text" : "我是一名java程序员"

效果如下:

{
"tokens": [
{
"token": "wo",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0}
,
{
"token": "wsymjavacxy",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 0}
,
{
"token": "shi",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1}
,
{
"token": "yi",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2}
,
{
"token": "ming",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 3}
,
{
"token": "ja",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 4}
,
{
"token": "v",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 5}
,
{
"token": "a",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 6}
,
{
"token": "cheng",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 7}
,
{
"token": "xu",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 8}
,
{
"token": "yuan",
"start_offset": 10,
"end_offset": 11,
"type": "word",
"position": 9}
]
}

IK+pinyin分词配置

创建索引和类型

-put http://localhost:9200/demo

{
"settings": {
"analysis": {
"analyzer": {
"ik_pinyin_analyzer": {//分词器名称,自定义
"type": "custom",//custom表示自己定制
"tokenizer": "ik_max_word",//分词的策略
"filter":["my_pinyin", "word_delimiter"]// 对拼音和分隔的词源做处理
}
},
"filter":{
"my_pinyin":{
"type":"pinyin",
"first_letter":"prefix",
"padding_char":" "
}
}
}
},
"mappings": {
"article": {
"properties": {
"subject": {
"type": "keyword",
"fields": {
"pinyin": {
"type": "text",
"store": "no",
"term_vector": "with_positions_offsets",
"analyzer": "ik_pinyin_analyzer",
"boost": 10

索引一个文档

-post http://localhost:9200/demo/article

{
"subject": "我是一名java程序员"

中文查询

-post http://localhost:9200/demo/article/_search

{
"query": {
"match": {
"subject.pinyin": "程序员"

结果如下:

{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0},
"hits": {
"total": 1,
"max_score": 14.584841,
"hits": [
{
"_index": "demo",
"_type": "article",
"_id": "AWIeeeTJ2JGj7w9eQwEK",
"_score": 14.584841,
"_source": {
"subject": "我是一名java程序员"}
}
]
}
}

拼音查询

-post http://localhost:9200/demo/article/_search

{
"query": {
"match": {
"subject.pinyin": "chengxuyuan"

查询结果:

{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0},
"hits": {
"total": 1,
"max_score": 4.3648314,
"hits": [
{
"_index": "demo",
"_type": "article",
"_id": "AWIeeeTJ2JGj7w9eQwEK",
"_score": 4.3648314,
"_source": {
"subject": "我是一名java程序员"}
}
]
}
}

注意:使用pinyin分词以后,原始的字段搜索要加上.pinyin后缀,搜索原始字段没有返回结果

ElasticSearch学习 - (八)安装中文分词器IK和拼音分词器_Ik_13