linux es查询索引 curl查询es某个索引数据

转载

mob64ca141834d3 2024-07-30 11:51:06

文章标签 linux es查询索引大数据 json python analyzer 文章分类 架构后端开发

学完本课题，你应达成如下目标：

掌握分词器的配置、测试
掌握文档的管理操作
掌握路由规则。

分词器

linux es查询索引 curl查询es某个索引数据_analyzer

认识分词器

Analyzer 分析器

在ES中一个Analyzer 由下面三种组件组合而成：

character filter ：字符过滤器，对文本进行字符过滤处理，如处理文本中的html标签字符。处理完后再交给tokenizer进行分词。一个analyzer中可包含0个或多个字符过滤器，多个按配置顺序依次进行处理。
tokenizer：分词器，对文本进行分词。一个analyzer必需且只可包含一个tokenizer。
token filter：词项过滤器，对tokenizer分出的词进行过滤处理。如转小写、停用词处理、同义词处理。一个analyzer可包含0个或多个词项过滤器，按配置顺序进行过滤。

如何测试分词器

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}

POST _analyze
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}

搞清楚position和offset

{
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "quick",
      "start_offset": 4,
      "end_offset": 9,
      "type": "word",
      "position": 1
    }

内建的character filter

HTML Strip Character Filter
        html_strip ：过滤html标签，解码HTML entities like &.
Mapping Character Filter
        mapping ：用指定的字符串替换文本中的某字符串。
Pattern Replace Character Filter
        pattern_replace ：进行正则表达式替换。

HTML Strip Character Filter

测试：

POST _analyze
{
  "tokenizer":      "keyword", 
  "char_filter":  [ "html_strip" ],
  "text": "<p>I'm so <b>happy</b>!</p>"
}

在索引中配置：

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
      }
    }
  }
}

测试：

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I'm so <b>happy</b>!</p>"
}

escaped_tags 用来指定例外的标签。如果没有例外标签需配置，则不需要在此进行客户化定义，在上面的my_analyzer中直接使用 html_strip

Mapping character filter

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html

Pattern Replace Character Filter

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html

内建的Tokenizer

linux es查询索引 curl查询es某个索引数据_json_02

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

集成的中文分词器Ikanalyzer中提供的tokenizer：ik_smart 、 ik_max_word

测试tokenizer

POST _analyze
{
  "tokenizer":      "standard", 
  "text": "张三说的确实在理"
}

POST _analyze
{
  "tokenizer":      "ik_smart", 
  "text": "张三说的确实在理"
}

内建的Token Filter

ES中内建了很多Token filter ，详细了解：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

Lowercase Token Filter ：lowercase 转小写
Stop Token Filter ：stop 停用词过滤器
Synonym Token Filter： synonym 同义词过滤器
说明：中文分词器Ikanalyzer中自带有停用词过滤功能。

Synonym Token Filter 同义词过滤器

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "my_ik_synonym" : {
                        "tokenizer" : "ik_smart",
                        "filter" : ["synonym"]
                    }
                },
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms_path" : "analysis/synonym.txt"
                    }
                }
            }
        }
    }
}
synonyms_path：指定同义词文件（相对config的位置）

同义词定义格式

ES同义词格式支持 solr、 WordNet 两种格式。

在analysis/synonym.txt中用solr格式定义如下同义词文件一定要UTF-8编码

张三,李四
电饭煲,电饭锅 => 电饭煲   
电脑 => 计算机,computer

一行一类同义词，=> 表示标准化为

测试：

POST test_index/_analyze
{
  "analyzer": "my_ik_synonym",
  "text": "张三说的确实在理"
}

POST test_index/_analyze
{
  "analyzer": "my_ik_synonym",
  "text": "我想买个电饭锅和一个电脑"
}

通过例子的结果了解同义词的处理行为

内建的Analyzer

linux es查询索引 curl查询es某个索引数据_json_03

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

集成的中文分词器Ikanalyzer中提供的Analyzer：ik_smart 、 ik_max_word

内建的和集成的analyzer可以直接使用。如果它们不能满足我们的需要，则我们可自己组合字符过滤器、分词器、词项过滤器来定义自定义的analyzer

自定义 Analyzer

zero or more character filters
a tokenizer
zero or more token filters.

配置参数：

linux es查询索引 curl查询es某个索引数据_json_04

PUT my_index8
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ik_analyzer": {
          "type": "custom",
          "tokenizer": "ik_smart",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
             "synonym"
          ]
        }
      },
      "filter": {
        "synonym": {
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt"
        }
      }    }  }}

为字段指定分词器

PUT my_index8/_mapping/_doc
{
  "properties": {
    "title": {
        "type": "text",
        "analyzer": "my_ik_analyzer"
    }
  }
}

PUT my_index8/_mapping/_doc
{
  "properties": {
    "title": {
        "type": "text",
        "analyzer": "my_ik_analyzer",
        "search_analyzer": "other_analyzer" 
    }
  }
}

PUT my_index8/_doc/1
{
  "title": "张三说的确实在理"
}

GET /my_index8/_search
{
  "query": {
    "term": {
      "title": "张三"
    }
  }
}

为索引定义个default分词器

PUT /my_index10
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "ik_smart",
          "filter": [
            "synonym"
          ]
        }
      },
      "filter": {
        "synonym": {
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt"
        }
      }
    }
  },

"mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index10/_doc/1
{
  "title": "张三说的确实在理"
}

GET /my_index10/_search
{
  "query": {
    "term": {
      "title": "张三"
    }
  }
}

Analyzer的使用顺序

我们可以为每个查询、每个字段、每个索引指定分词器。

在索引阶段ES将按如下顺序来选用分词：

首先选用字段mapping定义中指定的analyzer
字段定义中没有指定analyzer，则选用 index settings中定义的名字为default 的analyzer。
如index setting中没有定义default分词器，则使用 standard analyzer.

查询阶段ES将按如下顺序来选用分词：

The analyzer defined in a full-text query.
The search_analyzer defined in the field mapping.
The analyzer defined in the field mapping.
An analyzer named default_search in the index settings.
An analyzer named default in the index settings.
The standard analyzer.

文档管理

linux es查询索引 curl查询es某个索引数据_大数据_05

新建文档

PUT twitter/_doc/1            指定文档id，新增/修改
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

POST twitter/_doc/            新增，自动生成文档id
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

{
  "_index": "twitter",              //所属索引
  "_type": "_doc",                  //所属mapping type
  "_id": "p-D3ymMBl4RK_V6aWu_V",    //文档id
  "_version": 1,                    //文档版本
  "result": "created",
  "_shards": {                      //分片的写入情况
    "total": 3,                     //所在分片有三个副本
    "successful": 1,                //1个副本上成功写入
    "failed": 0                     //失败副本数
  },
  "_seq_no": 0,                     //第几次操作该文档
  "_primary_term": 3                //词项数
}

获取单个文档

HEAD twitter/_doc/11      查看是否存储

GET twitter/_doc/1

GET twitter/_doc/1?_source=false

GET twitter/_doc/1/_source

{
  "_index": "twitter",
  "_type": "_doc",
  "_id": "1",
  "_version": 2,
  "found": true,
  "_source": {
    "id": 1,
    "user": "kimchy",
    "post_date": "2009-11-15T14:12:12",
    "message": "trying out Elasticsearch"
  }}

PUT twitter11
{                                   //获取存储字段
   "mappings": {
      "_doc": {
         "properties": {
            "counter": {
               "type": "integer",
               "store": false
            },
            "tags": {
               "type": "keyword",
               "store": true
            } }   }  }}

PUT twitter11/_doc/1
{
    "counter" : 1,
    "tags" : ["red"]
}

GET twitter11/_doc/1?stored_fields=tags,counter

获取多个文档 _mget

GET /_mget
{
    "docs" : [
        {
            "_index" : "twitter",
            "_type" : "_doc",
            "_id" : "1"
        },
        {
            "_index" : "twitter",
            "_type" : "_doc",
            "_id" : "2"
            "stored_fields" : ["field3", "field4"]
        }
    ]
}

GET /twitter/_mget
{
    "docs" : [
        {
            "_type" : "_doc",
            "_id" : "1"
        },
        {
            "_type" : "_doc",
            "_id" : "2"
        }
    ]
}

GET /twitter/_doc/_mget
{
    "docs" : [
        {
            "_id" : "1"
        },
        {
            "_id" : "2"
        }
    ]
}

GET /twitter/_doc/_mget
{
    "ids" : ["1", "2"]
}

请求参数_source stored_fields 可以用在url上也可用在请求json串中

删除文档

DELETE twitter/_doc/1 指定文档id进行删除

DELETE twitter/_doc/1?version=1 用版本来控制删除

{
    "_shards" : {
        "total" : 2,
        "failed" : 0,
        "successful" : 2
    },
    "_index" : "twitter",
    "_type" : "_doc",
    "_id" : "1",
    "_version" : 2,
    "_primary_term": 1,
    "_seq_no": 5,
    "result": "deleted"
}

查询删除

POST twitter/_delete_by_query
{
  "query": { 
    "match": {
      "message": "some message"
    }
  }
}

POST twitter/_doc/_delete_by_query?conflicts=proceed
{
  "query": {
    "match_all": {}
  }
}
当有文档有版本冲突时，不放弃删除操作（记录冲突的文档，继续删除其他复合查询的文档）

通过task api 来查看查询删除任务

GET _tasks?detailed=true&actions=*/delete/byquery

GET /_tasks/taskId:1    查询具体任务的状态

POST _tasks/task_id:1/_cancel    取消任务

{
  "nodes" : {
    "r1A2WoRbTwKZ516z6NEs5A" : {
      "name" : "r1A2WoR",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "attributes" : {
        "testattr" : "test",
        "portsfile" : "true"
      },
      "tasks" : {
        "r1A2WoRbTwKZ516z6NEs5A:36619" : {
          "node" : "r1A2WoRbTwKZ516z6NEs5A",
          "id" : 36619,
          "type" : "transport",
          "action" : "indices:data/write/delete/byquery",
          "status" : {    
            "total" : 6154,
            "updated" : 0,
            "created" : 0,
            "deleted" : 3500,
            "batches" : 36,
            "version_conflicts" : 0,
            "noops" : 0,
            "retries": 0,
            "throttled_millis": 0
          },
          "description" : ""
        }      }   }  }}

更新文档

PUT twitter/_doc/1      指定文档id进行修改
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

PUT twitter/_doc/1?version=1     乐观锁并发更新控制
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

{
  "_index": "twitter",
  "_type": "_doc",
  "_id": "1",
  "_version": 3,
  "result": "updated",
  "_shards": {
    "total": 3,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 2,
  "_primary_term": 3
}

Scripted update 通过脚本来更新文档

PUT uptest/_doc/1       1、准备一个文档
{
    "counter" : 1,
    "tags" : ["red"]
}

POST uptest/_doc/1/_update     2、对文档1的counter + 4
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    }
}

POST uptest/_doc/1/_update     3、往数组中加入元素
{
    "script" : {
        "source": "ctx._source.tags.add(params.tag)",
        "lang": "painless",
        "params" : {
            "tag" : "blue"
        }
    }
}

脚本说明：painless是es内置的一种脚本语言，ctx执行上下文对象（通过它还可访问_index, _type, _id, _version, _routing and _now (the current timestamp) ），params是参数集合

Scripted update 通过脚本来更新文档

说明：脚本更新要求索引的_source 字段是启用的。更新执行流程：
1、获取到原文档
2、通过_source字段的原始数据，执行脚本修改。
3、删除原索引文档
4、索引修改后的文档
它只是降低了一些网络往返，并减少了get和索引之间版本冲突的可能性。

POST uptest/_doc/1/_update
{
    "script" : "ctx._source.new_field = 'value_of_new_field'"
}
4、添加一个字段

POST uptest/_doc/1/_update
{
    "script" : "ctx._source.remove('new_field')"
}
5、移除一个字段

POST uptest/_doc/1/_update
{
    "script" : {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = 'delete' } else { ctx.op = 'none' }",
        "lang": "painless",
        "params" : {
            "tag" : "green"
        }
    }
}
6、判断删除或不做什么

POST uptest/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    }
}
7、合并传人的文档字段进行更新

{
  "_index": "uptest",
  "_type": "_doc",
  "_id": "1",
  "_version": 4,
  "result": "noop",
  "_shards": {
    "total": 0,
    "successful": 0,
    "failed": 0
  }
}
8、再次执行7，更新内容相同，不需做什么

POST uptest/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    },
    "detect_noop": false
}
9、设置不做noop检测

POST uptest/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    },
    "upsert" : {
        "counter" : 1
    }
}
10、upsert 操作：如果要更新的文档存在，则执行脚本进行更新，如不存在，则把 upsert中的内容作为一个新文档写入。

查询更新

POST twitter/_update_by_query
{
  "script": {
    "source": "ctx._source.likes++",
    "lang": "painless"
  },
  "query": {
    "term": {
      "user": "kimchy"
    }
  }
}
通过条件查询来更新文档

批量操作

批量操作API /_bulk 让我们可以在一次调用中执行多个索引、删除操作。这可以大大提高索引数据的速度。批量操作内容体需按如下以新行分割的json结构格式给出：

action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
....
action_and_meta_data\n
optional_source\n

POST _bulk
{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "_doc", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "_doc", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "_doc", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }

action_and_meta_data: action可以是 index, create, delete and update ，meta_data 指: _index ,_type,_id

请求端点可以是: /_bulk, /{index}/_bulk, {index}/{type}/_bulk

curl + json 文件批量索引多个文档

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_doc/_bulk?pretty&refresh" --data-binary "@accounts.json"

curl "localhost:9200/_cat/indices?v"

reindex 重索引

Reindex API /_reindex 让我们可以将一个索引中的数据重索引到另一个索引中（拷贝），要求源索引的_source 是开启的。目标索引的setting 、mapping 信息与源索引无关。

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

重索引要考虑的一个问题：目标索引中存在源索引中的数据，这些数据的version如何处理。
1、如果没有指定version_type 或指定为 internal，则会是采用目标索引中的版本，重索引过程中，执行的就是新增、更新操作。

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "internal"
  }
}

2、如果想使用源索引中的版本来进行版本控制更新，则设置 version_type 为extenal。重索引操作将写入不存在的，更新旧版本的数据。

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "external"
  }
}

如果你只想从源索引中复制目标索引中不存在的文档数据，可以指定 op_type 为 create 。此时存在的文档将触发版本冲突（会导致放弃操作），可设置“conflicts”: “proceed“，跳过继续

POST _reindex
{
  "conflicts": "proceed",
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "op_type": "create"
  }
}

你也可以只索引源索引的一部分数据，通过 type 或查询来指定你需要的数据

POST _reindex
{
  "source": {
    "index": "twitter",
    "type": "_doc",
    "query": {
      "term": {
        "user": "kimchy"
      }
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}

可以从多个源获取数据

POST _reindex
{
  "source": {
    "index": ["twitter", "blog"],
    "type": ["_doc", "post"]
  },
  "dest": {
    "index": "all_together"
  }
}

POST _reindex
{
  "size": 10000,         可以限定文档数量
  "source": {
    "index": "twitter",
    "sort": { "date": "desc" }
  },
  "dest": {
    "index": "new_twitter"
  }
}

POST _reindex
{                       可以选择复制源文档的哪些字段
  "source": {
    "index": "twitter",
    "_source": ["user", "_doc"]
  },
  "dest": {
    "index": "new_twitter"
  }
}

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "external"
  },
  "script": {               //可以用script来改变文档
    "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}",
    "lang": "painless"
  }
}

POST _reindex
{
  "source": {
    "index": "source",
    "query": {
      "match": {
        "company": "cat"
      }
    }
  },
  "dest": {               //可以指定路由值
    "index": "dest",
    "routing": "=cat"
  }
}

POST _reindex
{
  "source": {            //从远程源复制
    "remote": {
      "host": "http://otherhost:9200",
      "username": "user",
      "password": "pass"
    },
    "index": "source",
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "dest"
  }
}

通过_task 来查询执行状态

GET _tasks?detailed=true&actions=*reindex

?refresh

对于索引、更新、删除操作如果想操作完后立马重刷新可见，可带上refresh参数。

PUT /test/_doc/1?refresh
{"test": "test"}
PUT /test/_doc/2?refresh=true
{"test": "test"}

refresh 可选值说明

未给值或=true，则立马会重刷新读索引。
=false ，相当于没带refresh 参数，遵循内部的定时刷新。
=wait_for ，登记等待刷新，当登记的请求数达到index.max_refresh_listeners 参数设定的值时(defaults to 1000)，将触发重刷新。

路由详解

linux es查询索引 curl查询es某个索引数据_json_06

集群组成

linux es查询索引 curl查询es某个索引数据_linux es查询索引_07

linux es查询索引 curl查询es某个索引数据_analyzer_08

linux es查询索引 curl查询es某个索引数据_analyzer_09

linux es查询索引 curl查询es某个索引数据_analyzer_10

创建索引的流程

linux es查询索引 curl查询es某个索引数据_大数据_11

linux es查询索引 curl查询es某个索引数据_大数据_12

节点故障

linux es查询索引 curl查询es某个索引数据_大数据_13

索引文档

linux es查询索引 curl查询es某个索引数据_json_14

文档是如何路由的

文档该存到哪个分片上？
决定文档存放到哪个分片上就是文档路由。ES中通过下面的计算得到每个文档的存放分片：

shard = hash(routing) % number_of_primary_shards

routing 是用来进行hash计算的路由值，默认是使用文档id值。我们可以在索引文档时通过routing参数指定别的路由值

POST twitter/_doc?routing=kimchy
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

在索引、删除、更新、查询中都可以使用routing参数（可多值）指定操作的分片。

PUT my_index2
{
  "mappings": {
    "_doc": {
      "_routing": {
        "required": true 
      }
    }
  }
}
强制要求给定路由值

思考：关系型数据库中有分区表，通过选定分区，可以降低操作的数据量，提高效率。在ES的索引中能不能这样做？

可以：通过指定路由值，让一个分片上存放一个区的数据。如按部门存放数据，则可指定路由值为部门值。

搜索

linux es查询索引 curl查询es某个索引数据_json_15

搜索的步骤：如要搜索索引 s1
1、node2解析查询。
2、node2将查询发给索引s1的分片/副本（R1,R2,R0）节点
3、各节点执行查询，将结果发给Node2
4、Node2合并结果，作出响应。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：服务器的use接口如何映射到容器服务器区域us

下一篇：地图服务录入mysql 地图数据库管理系统

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯