es搜索核心与实战 Day02

一、倒排索引

1.搜索引擎

  • 正排索引——文档ld到文档内容和单词的关联+
  • 倒排索引——单词到文档Id的关系

2。倒排索引的核心组成

倒排索引包含两个部分
  • 单词词典 (Term Dictionary), 记录所有文档的单词,记录单词到倒排列表的关联关系

单词词典一般比较大,可以通过B +树或哈希拉链法实现,以满足高性能的插入与查询

  • 倒排列表(Posting List) - 记录了单词对应的文档结合,由倒排索引项组成
  • 倒排索引项(Posting)
    1.文档ID
    2.词频TF-该单词在文档中出现的次数,用于相关性评分
    3.位置(Position) -单词在文档中分词的位置。用于语句搜索(phrase query)
    4.偏移(Offset) -记录单词的开始结束位置,实现高亮显示

二、通过Analyzer进行分词

GET _analyze
{
   //
  "analyzer": standard
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

1.standard analyzer

  • 默认分词器
  • 按词切分
  • 小写处理
    返回结果
{
  "tokens" : [
    {
      "token" : "2",//返回结果值
      "start_offset" : 0,//结果值开始位置
      "end_offset" : 1,//结果值结束位置
      "type" : "<NUM>",//结果值类型
      "position" : 0//第几个
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}

2.simple analyzer

  • 按照非字母切分
  • 非字母的都被去除
  • 小写处理
    返回结果
{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "word",
      "position" : 11
    }
  ]
}

3.whitespace analyzer

  • 按空格切分
    返回结果
{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "brown-foxes",
      "start_offset" : 16,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening.",
      "start_offset" : 62,
      "end_offset" : 70,
      "type" : "word",
      "position" : 11
    }
  ]
}

4.stop analyzer

  • 相比Simple Analyzer
  • 多了stop filter
  • 会把the,a,is等修饰性词语去除
    返回结果
{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "word",
      "position" : 11
    }
  ]
}

5.keyword analyzer
不分词,直接将输入当成一个term输出

6.pattern analyzer

  • 通过正则表达式进行分词
  • 默认是\W+,非字符的符号进行分隔

7.english analyzer

三、SearchAPI及URISearch详解

1.URI Search

  • 在URI中使用查询参数

2.Request Body Search

  • 使用Elasticsearch提供的,基于JSON格式的更加完备的Query Domain Specific Language (DSL)

3.搜索的相关性Relevance

  • 搜索是用户和搜索引擎的对话
  • 用户关心的是搜索结果的相关性
  • 是否可以找到所有相关的内容
  • 有多少不相关的内容被返回了
  • 文档的打分是否合理
  • 结合业务需求,平衡结果排名

Page Rank算法

  • 不仅仅是内容
  • 更重要的是内容的可信度

4.衡量相关性

  • Information Retrieval
  • Precision (查准率) -尽可能返回较少的无关文档
  • Recall (查全率) -尽量返回较多的相关文档
  • Ranking -是否能够按照相关度进行排序?

5.URISearch

a.指定字段

查询出指定字段(title)值为2012的数据

GET /movies/_search?q=2012&df=title
{
  "profile": "true"
}
b.泛查询

查询出任意字段值为2012的数据

GET /movies/_search?q=2012
{
	"profile": "true"
}
c.Term and Phrase
  • Beautiful Mind等效于Beautiful OR Mind
  • “Beautiful Mind”,等效于Beautiful AND Mind。Phrase 查询,还要求前后顺序保持一致
//使用引号,Phrase查询
GET /movies/_search?q=title:"Beautiful Mind"
{
   "profile": "true"
}
d.分组查询
//分组,Bool查询
GET /movies/_search?q=title:(Beautiful Mind)
{
   "profile": "true"
}

必须包含Beautiful和Mind

//查找美丽心灵
GET /movies/_search?q=title:(Beautiful AND Mind)
{
   "profile": "true"
}
//查找美丽心灵
GET /movies/_search?q=title:(Beautiful %2BMind)
{
   "profile": "true"
}

必须包含Beautiful不包含Mind

//查找美丽心灵
GET /movies/_search?q=title:(Beautiful NOT Mind)
{
   "profile": "true"
}
e.范围查询

年份大于1980

//范围查询,区间写法/数学写法
GET /movies/_search?q=year:>=1980
{
   "profile": "true"
}
f.通配符查询
  • ?代表1个字符,*代表0个或多个字符

title:mi?d

title:be*

四、Requestbody与QueryDSL以及QueryString&SimpleQueryString查询

1.Request Body Search

  • 将查询语句通过HTTP Requedt Body发送给Elasticsearch
  • Query DSL

2.查询表达式——Match

POST /movies/_search
{
  "query": {
    "match": {
      "title": "Last Christmas"
    }
  }
}

POST /movies/_search
{
  "query": {
    "match": {
      "title":{
        "query": "Last Christmas",
        "operator": "and"
      }
    }
  }
}

3.短语搜索------Match Phrase

POST /movies/_search
{
  "query": {
    "match_phrase": {
      "title":{
     // 字符按照下列顺序出现
        "query": "one love",
     //中间可以出现其他字符
        "slop": 1
      }
    }
  }
}

4.Simple Query String Query

  • 类似 Query String,但是会忽略错误的语法,同时只支持部分查询语法
  • 不支持AND OR NOT,会当作字符串处理.
  • Term 之间默认的关系是OR,可以指定Operator
  • 支持部分逻辑

+替代AND

| 替代OR

-替代NOT