常规操作elasticSearch分词和安装分词器
分词:
POST _analyze
{
"analyzer": "standard",
"text": "Today is what sunny."
}
{
"tokens" : [
{
"token" : "today",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "is",
"start_offset" : 6,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "what",
"start_offset" : 9,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "sunny",
"start_offset" : 14,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
默认分词器针对英文:
POST _analyze
{
"analyzer": "standard",
"text": "我是中国人."
}
如下:中文词语没有分出
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "中",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "国",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "人",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
}
]
}
借助插件:ik分词器
下载ik分词器:
https://github.com/medcl/elasticsearch-analysis-ik/releases
下载zip文件后,解压,放置elasticsearch 目录下的plugins的文件中
把这个文件夹 授权
chmod -R 777 ik
进入容器(如果你是docker安装的es的话,非docker安装的直接跟着下一步走)
ik分词器是个插件哦 我们检查ik的安装情况:
找到 elasticsearch-plugin :docker容器安装的此文件位置在:
/usr/share/elasticsearch/bin/elasticsearch-plugin
操作:
执行:elasticsearch-plugin
得到:
Option Description
------ -----------
-h, --help show help
-s, --silent show minimal output
-v, --verbose show verbose output
ERROR: Missing command
[root@6a850788e223 bin]# elasticsearch-plugin -h
A tool for managing installed elasticsearch plugins
查阅帮助文档:
elasticsearch-plugin -h
ommands
--------
list - Lists installed elasticsearch plugins
install - Install a plugin
remove - removes a plugin from Elasticsearch
执行查看插件列表:
[root@6a850788e223 bin]# elasticsearch-plugin list
ik
通过运行结果得知:安装成功
安装完毕后,重启服务
测试ik分词器:
智能分词:
POST _analyze
{
"analyzer": "ik_smart",
"text": "我是中国人."
}
结果:
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
}
]
}
最大单词组合:
POST _analyze
{
"analyzer": "ik_max_word",
"text": "我是中国人."
}
结果:
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "中国",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "国人",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 4
}
]
}
创建自定义分词:
附录:搞一个Nginx服务
docker run -p 80:80 --name nginx -d nginx:1.10
docker container cp nginx:/etc/nginx /mydata/nginx/conf
docker stop nginx
:docker rm $ContainerId
docker run -p 80:80 --name nginx1.10 \
-v /mydata/nginx1.10/html:/usr/share/nginx/html \
-v /mydata/nginx1.10/logs:/var/log/nginx \
-v /mydata/nginx1.10/conf:/etc/nginx \
-d nginx:1.10
在nginx服务目录 模拟放置分词资源
/mydata/nginx1.10/html/es
[root@bogon es]# ls
fenci.txt
赵一
钱二
孙三
李四
ik分词器的配置:
elasticsearch的plugins的ik目录中: config
[root@bogon ik]# pwd
/mydata/elasticsearch/plugins/ik
[root@bogon ik]# ls
commons-codec-1.9.jar config httpclient-4.5.2.jar plugin-descriptor.properties
commons-logging-1.2.jar elasticsearch-analysis-ik-7.4.2.jar httpcore-4.4.4.jar plugin-security.policy
[root@bogon ik]# cd config/
[root@bogon config]# ls
extra_main.dic extra_single_word_full.dic extra_stopword.dic main.dic quantifier.dic suffix.dic
extra_single_word.dic extra_single_word_low_freq.dic IKAnalyzer.cfg.xml preposition.dic stopword.dic surname.dic
编辑配置文件:
vim IKAnalyzer.cfg.xml
配置文件源文件:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict"></entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
修改远程配置分词:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict"></entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">http://192.168.31.125/es/fenci.txt</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
修改完毕保存 重启elasticSearch
[root@bogon config]# docker restart elasticsearch
elasticsearch
测试ik分词器自定义分词:
POST _analyze
{
"analyzer": "ik_max_word",
"text": "赵一钱二孙三李四."
}
结果:
{
"tokens" : [
{
"token" : "赵一",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "一钱",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "一",
"start_offset" : 1,
"end_offset" : 2,
"type" : "TYPE_CNUM",
"position" : 2
},
{
"token" : "钱二",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "钱",
"start_offset" : 2,
"end_offset" : 3,
"type" : "COUNT",
"position" : 4
},
{
"token" : "二",
"start_offset" : 3,
"end_offset" : 4,
"type" : "TYPE_CNUM",
"position" : 5
},
{
"token" : "孙三",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "三",
"start_offset" : 5,
"end_offset" : 6,
"type" : "TYPE_CNUM",
"position" : 7
},
{
"token" : "李四",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 8
},
{
"token" : "四",
"start_offset" : 7,
"end_offset" : 8,
"type" : "TYPE_CNUM",
"position" : 9
}
]
}