es terms聚合时间年月日 es聚合不准

转载

云端小仙童 2024-03-26 10:00:33

文章标签 es terms聚合时间年月日 elasticsearch 搜索引擎经验分享数据分析 文章分类 架构后端开发

Elasticsearch

聚合的精准度问题

分布式系统的近似统计算法

es terms聚合时间年月日 es聚合不准_数据分析

Min 聚合分析的执⾏流程

es terms聚合时间年月日 es聚合不准_es terms聚合时间年月日_02

Terms Aggregation 的返回值

es terms聚合时间年月日 es聚合不准_数据分析_03

Terms 聚合分析的执⾏流程

es terms聚合时间年月日 es聚合不准_经验分享_04

Terms 不正确的案例

es terms聚合时间年月日 es聚合不准_搜索引擎_05

如何解决 Terms 不准的问题：提升 shard_size 的参数

es terms聚合时间年月日 es聚合不准_搜索引擎_06

打开 show_term_doc_count_error

es terms聚合时间年月日 es聚合不准_数据分析_07

shard_size 设定

调整 shard size ⼤⼩，降低 doc_count_error_upper_bound 来提升准确度

增加整体计算量，提⾼了准确度，但会降低相应时间

Shard Size 默认⼤⼩设定

demoAPI

DELETE my_flights
PUT my_flights
{
  "settings": {
    "number_of_shards": 20
  },
  "mappings" : {
      "properties" : {
        "AvgTicketPrice" : {
          "type" : "float"
        },
        "Cancelled" : {
          "type" : "boolean"
        },
        "Carrier" : {
          "type" : "keyword"
        },
        "Dest" : {
          "type" : "keyword"
        },
        "DestAirportID" : {
          "type" : "keyword"
        },
        "DestCityName" : {
          "type" : "keyword"
        },
        "DestCountry" : {
          "type" : "keyword"
        },
        "DestLocation" : {
          "type" : "geo_point"
        },
        "DestRegion" : {
          "type" : "keyword"
        },
        "DestWeather" : {
          "type" : "keyword"
        },
        "DistanceKilometers" : {
          "type" : "float"
        },
        "DistanceMiles" : {
          "type" : "float"
        },
        "FlightDelay" : {
          "type" : "boolean"
        },
        "FlightDelayMin" : {
          "type" : "integer"
        },
        "FlightDelayType" : {
          "type" : "keyword"
        },
        "FlightNum" : {
          "type" : "keyword"
        },
        "FlightTimeHour" : {
          "type" : "keyword"
        },
        "FlightTimeMin" : {
          "type" : "float"
        },
        "Origin" : {
          "type" : "keyword"
        },
        "OriginAirportID" : {
          "type" : "keyword"
        },
        "OriginCityName" : {
          "type" : "keyword"
        },
        "OriginCountry" : {
          "type" : "keyword"
        },
        "OriginLocation" : {
          "type" : "geo_point"
        },
        "OriginRegion" : {
          "type" : "keyword"
        },
        "OriginWeather" : {
          "type" : "keyword"
        },
        "dayOfWeek" : {
          "type" : "integer"
        },
        "timestamp" : {
          "type" : "date"
        }
      }
    }
}


POST _reindex
{
  "source": {
    "index": "kibana_sample_data_flights"
  },
  "dest": {
    "index": "my_flights"
  }
}

GET kibana_sample_data_flights/_count
GET my_flights/_count

get kibana_sample_data_flights/_search


GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "weather": {
      "terms": {
        "field":"OriginWeather",
        "size":5,
        "show_term_doc_count_error":true
      }
    }
  }
}


GET my_flights/_search
{
  "size": 0,
  "aggs": {
    "weather": {
      "terms": {
        "field":"OriginWeather",
        "size":1,
        "shard_size":1,
        "show_term_doc_count_error":true
      }
    }
  }
}

对象及 Nested 对象

数据的关联关系

真实世界中有很多重要的关联关系

博客 / 作者 / 评论
银⾏账户有多次交易记录
客户有多个银⾏账户
⽬录⽂件有多个⽂件和⼦⽬录

关系型数据库的范式化设计

es terms聚合时间年月日 es聚合不准_elasticsearch_08

Denormalization

反范式化设计

数据 “Flattening”，不使⽤关联关系，⽽是在⽂档中保存冗余的数据拷⻉

优点：⽆需处理 Joins 操作，数据读取性能好

Elasticsearch 通过压缩 _source 字段，减少磁盘空间的开销

缺点：不适合在数据频繁修改的场景

⼀条数据（⽤户名）的改动，可能会引起很多数据的更新

在 Elasticsearch 中处理关联关系

关系型数据库，⼀般会考虑 Normalize 数据；在 Elasticsearch，往往考虑 Denormalize 数据

Denormalize 的好处：读的速度变快 / ⽆需表连接 / ⽆需⾏锁

Elasticsearch 并不擅⻓处理关联关系。我们⼀般采⽤以下四种⽅法处理关联

对象类型
嵌套对象(Nested Object)
⽗⼦关联关系(Parent / Child )
应⽤端关联

案例 1：博客和其作者信息

es terms聚合时间年月日 es聚合不准_elasticsearch_09

es terms聚合时间年月日 es聚合不准_es terms聚合时间年月日_10

案例 2：包含对象数组的⽂档

es terms聚合时间年月日 es聚合不准_es terms聚合时间年月日_11

es terms聚合时间年月日 es聚合不准_elasticsearch_12

为什么会搜到不需要的结果？

存储时，内部对象的边界并没有考虑在内，JSON 格式被处理成扁平式键值对的结构
当对多个字段进⾏查询时，导致了意外的搜索结果
可以⽤ Nested Data Type 解决这个问题

什么是 Nested Data Type

es terms聚合时间年月日 es聚合不准_搜索引擎_13

嵌套查询

es terms聚合时间年月日 es聚合不准_搜索引擎_14

嵌套聚合

es terms聚合时间年月日 es聚合不准_es terms聚合时间年月日_15

本节知识点

在 Elasticsearch 中，往往会 Denormalize 数据的⽅式建模(使⽤对象的⽅式)

好处是：读写的速度变快 / ⽆需表连接 / ⽆需⾏锁

如果⽂档的更新并不频繁，可以在⽂档中使⽤对象
当对象包含了多值对象时

可以使⽤嵌套对象(Nested Object)解决查询正确性的问题

demoAPI

DELETE blog
# 设置blog的 Mapping
PUT /blog
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      },
      "time": {
        "type": "date"
      },
      "user": {
        "properties": {
          "city": {
            "type": "text"
          },
          "userid": {
            "type": "long"
          },
          "username": {
            "type": "keyword"
          }
        }
      }
    }
  }
}


# 插入一条 Blog 信息
PUT blog/_doc/1
{
  "content":"I like Elasticsearch",
  "time":"2019-01-01T00:00:00",
  "user":{
    "userid":1,
    "username":"Jack",
    "city":"Shanghai"
  }
}


# 查询 Blog 信息
POST blog/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"content": "Elasticsearch"}},
        {"match": {"user.username": "Jack"}}
      ]
    }
  }
}


DELETE my_movies

# 电影的Mapping信息
PUT my_movies
{
      "mappings" : {
      "properties" : {
        "actors" : {
          "properties" : {
            "first_name" : {
              "type" : "keyword"
            },
            "last_name" : {
              "type" : "keyword"
            }
          }
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
}


# 写入一条电影信息
POST my_movies/_doc/1
{
  "title":"Speed",
  "actors":[
    {
      "first_name":"Keanu",
      "last_name":"Reeves"
    },

    {
      "first_name":"Dennis",
      "last_name":"Hopper"
    }

  ]
}

# 查询电影信息
POST my_movies/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"actors.first_name": "Keanu"}},
        {"match": {"actors.last_name": "Hopper"}}
      ]
    }
  }

}

DELETE my_movies
# 创建 Nested 对象 Mapping
PUT my_movies
{
      "mappings" : {
      "properties" : {
        "actors" : {
          "type": "nested",
          "properties" : {
            "first_name" : {"type" : "keyword"},
            "last_name" : {"type" : "keyword"}
          }},
        "title" : {
          "type" : "text",
          "fields" : {"keyword":{"type":"keyword","ignore_above":256}}
        }
      }
    }
}


POST my_movies/_doc/1
{
  "title":"Speed",
  "actors":[
    {
      "first_name":"Keanu",
      "last_name":"Reeves"
    },

    {
      "first_name":"Dennis",
      "last_name":"Hopper"
    }

  ]
}

# Nested 查询
POST my_movies/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "Speed"}},
        {
          "nested": {
            "path": "actors",
            "query": {
              "bool": {
                "must": [
                  {"match": {
                    "actors.first_name": "Keanu"
                  }},

                  {"match": {
                    "actors.last_name": "Hopper"
                  }}
                ]
              }
            }
          }
        }
      ]
    }
  }
}


# Nested Aggregation
POST my_movies/_search
{
  "size": 0,
  "aggs": {
    "actors": {
      "nested": {
        "path": "actors"
      },
      "aggs": {
        "actor_name": {
          "terms": {
            "field": "actors.first_name",
            "size": 10
          }
        }
      }
    }
  }
}


# 普通 aggregation不工作
POST my_movies/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "terms": {
        "field": "actors.first_name",
        "size": 10
      }
    }
  }
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。