Elasticsearch
聚合的精准度问题
分布式系统的近似统计算法
Min 聚合分析的执⾏流程
Terms Aggregation 的返回值
Terms 聚合分析的执⾏流程
Terms 不正确的案例
如何解决 Terms 不准的问题:提升 shard_size 的参数
打开 show_term_doc_count_error
shard_size 设定
- 调整 shard size ⼤⼩,降低 doc_count_error_upper_bound 来提升准确度
- 增加整体计算量,提⾼了准确度,但会降低相应时间
- Shard Size 默认⼤⼩设定
- shard size = size *1.5 +10
- https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-aggregationsbucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregationapproximate-counts
demoAPI
DELETE my_flights
PUT my_flights
{
"settings": {
"number_of_shards": 20
},
"mappings" : {
"properties" : {
"AvgTicketPrice" : {
"type" : "float"
},
"Cancelled" : {
"type" : "boolean"
},
"Carrier" : {
"type" : "keyword"
},
"Dest" : {
"type" : "keyword"
},
"DestAirportID" : {
"type" : "keyword"
},
"DestCityName" : {
"type" : "keyword"
},
"DestCountry" : {
"type" : "keyword"
},
"DestLocation" : {
"type" : "geo_point"
},
"DestRegion" : {
"type" : "keyword"
},
"DestWeather" : {
"type" : "keyword"
},
"DistanceKilometers" : {
"type" : "float"
},
"DistanceMiles" : {
"type" : "float"
},
"FlightDelay" : {
"type" : "boolean"
},
"FlightDelayMin" : {
"type" : "integer"
},
"FlightDelayType" : {
"type" : "keyword"
},
"FlightNum" : {
"type" : "keyword"
},
"FlightTimeHour" : {
"type" : "keyword"
},
"FlightTimeMin" : {
"type" : "float"
},
"Origin" : {
"type" : "keyword"
},
"OriginAirportID" : {
"type" : "keyword"
},
"OriginCityName" : {
"type" : "keyword"
},
"OriginCountry" : {
"type" : "keyword"
},
"OriginLocation" : {
"type" : "geo_point"
},
"OriginRegion" : {
"type" : "keyword"
},
"OriginWeather" : {
"type" : "keyword"
},
"dayOfWeek" : {
"type" : "integer"
},
"timestamp" : {
"type" : "date"
}
}
}
}
POST _reindex
{
"source": {
"index": "kibana_sample_data_flights"
},
"dest": {
"index": "my_flights"
}
}
GET kibana_sample_data_flights/_count
GET my_flights/_count
get kibana_sample_data_flights/_search
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"weather": {
"terms": {
"field":"OriginWeather",
"size":5,
"show_term_doc_count_error":true
}
}
}
}
GET my_flights/_search
{
"size": 0,
"aggs": {
"weather": {
"terms": {
"field":"OriginWeather",
"size":1,
"shard_size":1,
"show_term_doc_count_error":true
}
}
}
}
对象及 Nested 对象
数据的关联关系
- 真实世界中有很多重要的关联关系
- 博客 / 作者 / 评论
- 银⾏账户有多次交易记录
- 客户有多个银⾏账户
- ⽬录⽂件有多个⽂件和⼦⽬录
关系型数据库的范式化设计
Denormalization
- 反范式化设计
- 数据 “Flattening”,不使⽤关联关系,⽽是在⽂档中保存冗余的数据拷⻉
- 优点:⽆需处理 Joins 操作,数据读取性能好
- Elasticsearch 通过压缩 _source 字段,减少磁盘空间的开销
- 缺点:不适合在数据频繁修改的场景
- ⼀条数据(⽤户名)的改动,可能会引起很多数据的更新
在 Elasticsearch 中处理关联关系
- 关系型数据库,⼀般会考虑 Normalize 数据;在 Elasticsearch,往往考虑 Denormalize 数据
- Denormalize 的好处:读的速度变快 / ⽆需表连接 / ⽆需⾏锁
- Elasticsearch 并不擅⻓处理关联关系。我们⼀般采⽤以下四种⽅法处理关联
- 对象类型
- 嵌套对象(Nested Object)
- ⽗⼦关联关系(Parent / Child )
- 应⽤端关联
案例 1:博客和其作者信息
案例 2:包含对象数组的⽂档
为什么会搜到不需要的结果?
- 存储时,内部对象的边界并没有考虑在内,JSON 格式被处理成扁平式键值对的结构
- 当对多个字段进⾏查询时,导致了意外的搜索结果
- 可以⽤ Nested Data Type 解决这个问题
什么是 Nested Data Type
嵌套查询
嵌套聚合
本节知识点
- 在 Elasticsearch 中,往往会 Denormalize 数据的⽅式建模(使⽤对象的⽅式)
- 好处是:读写的速度变快 / ⽆需表连接 / ⽆需⾏锁
- 如果⽂档的更新并不频繁,可以在⽂档中使⽤对象
- 当对象包含了多值对象时
- 可以使⽤嵌套对象(Nested Object)解决查询正确性的问题
demoAPI
DELETE blog
# 设置blog的 Mapping
PUT /blog
{
"mappings": {
"properties": {
"content": {
"type": "text"
},
"time": {
"type": "date"
},
"user": {
"properties": {
"city": {
"type": "text"
},
"userid": {
"type": "long"
},
"username": {
"type": "keyword"
}
}
}
}
}
}
# 插入一条 Blog 信息
PUT blog/_doc/1
{
"content":"I like Elasticsearch",
"time":"2019-01-01T00:00:00",
"user":{
"userid":1,
"username":"Jack",
"city":"Shanghai"
}
}
# 查询 Blog 信息
POST blog/_search
{
"query": {
"bool": {
"must": [
{"match": {"content": "Elasticsearch"}},
{"match": {"user.username": "Jack"}}
]
}
}
}
DELETE my_movies
# 电影的Mapping信息
PUT my_movies
{
"mappings" : {
"properties" : {
"actors" : {
"properties" : {
"first_name" : {
"type" : "keyword"
},
"last_name" : {
"type" : "keyword"
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
# 写入一条电影信息
POST my_movies/_doc/1
{
"title":"Speed",
"actors":[
{
"first_name":"Keanu",
"last_name":"Reeves"
},
{
"first_name":"Dennis",
"last_name":"Hopper"
}
]
}
# 查询电影信息
POST my_movies/_search
{
"query": {
"bool": {
"must": [
{"match": {"actors.first_name": "Keanu"}},
{"match": {"actors.last_name": "Hopper"}}
]
}
}
}
DELETE my_movies
# 创建 Nested 对象 Mapping
PUT my_movies
{
"mappings" : {
"properties" : {
"actors" : {
"type": "nested",
"properties" : {
"first_name" : {"type" : "keyword"},
"last_name" : {"type" : "keyword"}
}},
"title" : {
"type" : "text",
"fields" : {"keyword":{"type":"keyword","ignore_above":256}}
}
}
}
}
POST my_movies/_doc/1
{
"title":"Speed",
"actors":[
{
"first_name":"Keanu",
"last_name":"Reeves"
},
{
"first_name":"Dennis",
"last_name":"Hopper"
}
]
}
# Nested 查询
POST my_movies/_search
{
"query": {
"bool": {
"must": [
{"match": {"title": "Speed"}},
{
"nested": {
"path": "actors",
"query": {
"bool": {
"must": [
{"match": {
"actors.first_name": "Keanu"
}},
{"match": {
"actors.last_name": "Hopper"
}}
]
}
}
}
}
]
}
}
}
# Nested Aggregation
POST my_movies/_search
{
"size": 0,
"aggs": {
"actors": {
"nested": {
"path": "actors"
},
"aggs": {
"actor_name": {
"terms": {
"field": "actors.first_name",
"size": 10
}
}
}
}
}
}
# 普通 aggregation不工作
POST my_movies/_search
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"field": "actors.first_name",
"size": 10
}
}
}
}