在我们实际的很多位置搜索中,我们有许多案例需要针对某个区域的搜索结果进行加权,从而使得这个区域的搜索结果的得分较高而排在返回结果的前面。比如有一下的一些使用场景:

  • 针对地理位置搜索,对于某个区域的搜索结果进行加权,从而提高对这个区域人口的警觉。在 Elasticsearch 中,我们可以使用行政区域来进行检索。你可以在文章中看到这个是如何实现的。关于 EMS (Elastic Maps Service) 的更多可以在链接找到。
  • 在实际的应用中,我们可能遇到很少的情况是按照行政区域进行划分的。针对一些特殊行业,比如快递。我们可能指定某些快递员专门负责一个区域的投放,但是如果在该区域的快递全部投放完毕,可以让这部分人帮忙投放其它相邻区域的投放。在这种情况下,可以针对这些快递员负责的区域进行加权,从而让他们负责的区域的快件搜索结果靠前,相邻区域的次之。

针对上面的两种情况,我们可能需要针对他们进行特别区域的划分。我们可以使用一个 Polygon 来画一个我们想要的区域,并对它的搜索结果进行加权。

es 设置词的权重 es加权排序_es 设置词的权重

采用的方法

我们可以通过 Elasticsearch 所提供的 compound query:

{
  "query": {
    "bool": {
      "must": [
        搜索的区域
      ],
      "should": [
        对搜索区域交叉的区域进行加权
      ]
    }
  }
}

如果你对 compound query 不是很熟的话,请参考我之前的文章 “开始使用Elasticsearch (2)”。

准备数据

在做这个练习之前,你可以参考我之前的文章 “Elasticsearch:如何制作 GeoJSON 文件并进行地理位置搜索”。在那里我详述了如何把数据导入及使用 GeoJSON 来制作一个边界。 针对今天的练习,我们使用如下的数据:

POST my_locations/_bulk
{ "index" : { "_id" : "3" } }
{ "location" : [ -104.06876, 39.77462 ], "name": "C" }
{ "index" : { "_id" : "4" } }
{ "location" : [ -103.59538, 38.5718 ], "name": "D" }
{ "index" : { "_id" : "5" } }
{ "location" : [ -104.94538, 38.16629 ], "name": "E" }
{ "index" : { "_id" : "1" } }
{ "location" : [ -105.38369, 40.11067 ], "name": "A" }
{ "index" : { "_id" : "6" } }
{ "location" : [ -107.99602, 39.17918 ], "name": "F" }
{ "index" : { "_id" : "2" } }
{ "location" : [ -104.34051, 40.03688 ], "name": "B" }

运行上面的命令,创建相应的索引模式。按照之前的文章,为了展示的目的,我们也创建了一个 GeoJSON 文件:

simple.json

{
    "type": "FeatureCollection",
    "features": [
    {
        "type": "Feature",
        "properties": {},
        "geometry": {
            "type": "Polygon",
            "coordinates": [
                [
                    [
                        -106.10465,
                        40.16875
                      ],
                      [
                        -106.0736,
                        39.33315
                      ],
                      [
                        -105.142,
                        39.16482
                      ],
                      [
                        -103.85329,
                        39.18889
                      ],
                      [
                        -103.52723,
                        39.77609
                      ],
                      [
                        -104.17935,
                        40.27545
                      ],
                      [
                        -105.17305,
                        40.33465
                      ],
                      [
                        -106.10465,
                        40.16875
                      ]          
                ]
            ]
        }
    },
    {
        "type": "Feature",
        "properties": {},
        "geometry": {
            "type": "Polygon",
            "coordinates": [
                [
                    [
                        -109.07025,
                        41.00014
                      ],
                      [
                        -109.07025,
                        36.99584
                      ],
                      [
                        -102.02114,
                        36.99584
                      ],
                      [
                        -102.02114,
                        41.00014
                      ],
                      [
                        -109.07025,
                        41.00014
                      ]                ]
            ]
        }
    }
]
}

我们可以按照文章 “Elasticsearch:如何制作 GeoJSON 文件并进行地理位置搜索” 中所介绍的那样制作相应的边界:

es 设置词的权重 es加权排序_大数据_02

如上图所示,文档 A, B, C 位于定义的 Polygon  之内,而 D, E, F 则不在 Polygon 之内。我们现在的要求是:

  1. 搜索到所有位于长方形内的所有文档
  2. 针对位于 Polygon 内的所有文档进行加权,从而使得它们的得分较高

搜索结果

按照上面的要求,我们可以进行如下的搜索:

GET my_locations/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "geo_shape": {
            "location": {
              "shape": {
                "type": "polygon",
                "coordinates": [
                  [
                      [
                        -109.07025,
                        41.00014
                      ],
                      [
                        -109.07025,
                        36.99584
                      ],
                      [
                        -102.02114,
                        36.99584
                      ],
                      [
                        -102.02114,
                        41.00014
                      ],
                      [
                        -109.07025,
                        41.00014
                      ]  
                  ]
                ]
              }
            }
          }
        }
      ],
      "should": [
        {
          "geo_polygon": {
            "location": {
              "points": [
                    [
                        -106.10465,
                        40.16875
                      ],
                      [
                        -106.0736,
                        39.33315
                      ],
                      [
                        -105.142,
                        39.16482
                      ],
                      [
                        -103.85329,
                        39.18889
                      ],
                      [
                        -103.52723,
                        39.77609
                      ],
                      [
                        -104.17935,
                        40.27545
                      ],
                      [
                        -105.17305,
                        40.33465
                      ],
                      [
                        -106.10465,
                        40.16875
                      ]          
                
              ]
            }
          }
        }
      ]
    }
  }
}

请注意在 must 中,我们使用的是在 GeoJSON 文件 simple.json 中的 rectangle 的坐标,而在 should 中我们使用的是 ploygon 的坐标。由于 rectange 可以看做是 ploygon 的一种特殊形式,我们统一使用 geo_shape 来进行搜索。当然在这里针对 rectangle 的搜索你也可以使用 geo_bounding_box 来进行搜索。

搜索的结果如下:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_locations",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "location" : [
            -104.06876,
            39.77462
          ],
          "name" : "C"
        }
      },
      {
        "_index" : "my_locations",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "location" : [
            -105.38369,
            40.11067
          ],
          "name" : "A"
        }
      },
      {
        "_index" : "my_locations",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "location" : [
            -104.34051,
            40.03688
          ],
          "name" : "B"
        }
      },
      {
        "_index" : "my_locations",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.0,
        "_source" : {
          "location" : [
            -103.59538,
            38.5718
          ],
          "name" : "D"
        }
      },
      {
        "_index" : "my_locations",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.0,
        "_source" : {
          "location" : [
            -104.94538,
            38.16629
          ],
          "name" : "E"
        }
      },
      {
        "_index" : "my_locations",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 0.0,
        "_source" : {
          "location" : [
            -107.99602,
            39.17918
          ],
          "name" : "F"
        }
      }
    ]
  }
}

从返回的结果来看,A, B, C 文档的得分较高,并排在前面。

如果我们不使用加权:

GET my_locations/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "geo_shape": {
            "location": {
              "shape": {
                "type": "polygon",
                "coordinates": [
                  [
                      [
                        -109.07025,
                        41.00014
                      ],
                      [
                        -109.07025,
                        36.99584
                      ],
                      [
                        -102.02114,
                        36.99584
                      ],
                      [
                        -102.02114,
                        41.00014
                      ],
                      [
                        -109.07025,
                        41.00014
                      ]  
                  ]
                ]
              }
            }
          }
        }
      ]
    }
  }
}

搜索的结果是:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "my_locations",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.0,
        "_source" : {
          "location" : [
            -104.06876,
            39.77462
          ],
          "name" : "C"
        }
      },
      {
        "_index" : "my_locations",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.0,
        "_source" : {
          "location" : [
            -103.59538,
            38.5718
          ],
          "name" : "D"
        }
      },
      {
        "_index" : "my_locations",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.0,
        "_source" : {
          "location" : [
            -104.94538,
            38.16629
          ],
          "name" : "E"
        }
      },
      {
        "_index" : "my_locations",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.0,
        "_source" : {
          "location" : [
            -105.38369,
            40.11067
          ],
          "name" : "A"
        }
      },
      {
        "_index" : "my_locations",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 0.0,
        "_source" : {
          "location" : [
            -107.99602,
            39.17918
          ],
          "name" : "F"
        }
      },
      {
        "_index" : "my_locations",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.0,
        "_source" : {
          "location" : [
            -104.34051,
            40.03688
          ],
          "name" : "B"
        }
      }
    ]
  }
}

从上面,我们可以看出来 A,B,C 的结果不一定是在前面。

在上面需要注意的一点是:geo_shape 搜索在最新的版本中是建议可替代 geo_polygon,但是在实际的使用中,我发现 geo_shape 的搜索是不给任何分数的,score 为 0。geo_bounding_box 以及 geo_polygon 是可以给出一个分数的。在这种应用场景中建议使用它们来计分。