文章目录
- search基础
- 分词器
- request请求
- SearchSourceBuilder
- 查询
- QueryBuilders
- QueryStringQuery
- 排序
- 游标Scroll
- 聚合
- AggregationBuilders
- nested嵌套聚合
- 排序
- 查询聚合
- collapse去重
在《Elastic中index与document基本操作》中介绍了Elastic的基本知识,及索引与文档操作;本节将介绍Elasticsearch中常用的查询与聚合操作。
search基础
Elasticsearch会对文档内容进行分词,并根据分词建立倒排序索引;可使用keyword({field}.keyword),匹配某字段的完整输入。
所有示例以如下结构为例:
{
"number": 1802,
"name": "Name 36zou",
"age": 28,
"courses": [
{
"name": "maths",
"hours": 160,
"teacher": "mike"
},
{
"name": "english",
"hours": 120,
"teacher": "tom"
}
]
}
分词器
分词器(analyzer) 接受一个字符串作为输入,将这个字符串拆分成独立的词或 语汇单元(token)(可能会丢弃一些标点符号等字符),然后输出一个 语汇单元流(token stream) 。
内置分词器:
分词器 | 说明 |
standard | 默认分词器:将词汇转换为小写,并去除停用词、标点符号(除下划线 |
simple | 通过非字母字符来分割文本信息,然后将词汇单元统一转换为小写形式,会去除掉数字类型的字符 |
whitespace | 仅仅是去除空格、不支持中文;对分割的词汇单元不做标准化的处理,也不会将字符转换成小写 |
pattern | 正则表达式分词,默认\W+(非字符分割) |
keyword | 不做分词,直接输入做输出(做整体查询时,使用{field}.keyword |
language | 特定语言分词器 |
customer | 自定义分词 |
request请求
ES通过SearchRequest构造请求,并通过search返回SearchResponse;response中包含查询的记录信息以及满足条件的总量信息。构造SearchRequest时需要提供index名:
- index可以有多个:此时查询所有指定索引;
- 可通过
*
模糊匹配:如test*
会匹配所有以test开始的index;
SearchHit即为请求的文档内容:
- getSourceAsString:把内容转换为字符串(Json格式);
- getSourceAsMap:把内容转换为Map,方便获取(数值默认都是long类型);
- getHits:获取查询的结果数组;其方法
.getTotalHits().value
获取总的记录条数(ES中满足条件的记录总数),而其属性.length
为当前返回的条数;
public void searchQuery(String index, SearchSourceBuilder sourceBuilder) {
try (RestHighLevelClient rhlClient = ESClient.getClient()) {
SearchRequest reqSearch = new SearchRequest(index);
reqSearch.source(sourceBuilder);
SearchResponse respSearch = rhlClient.search(reqSearch, RequestOptions.DEFAULT);
SearchHits gotHits = respSearch.getHits();
System.out.printf("get size: %d, total size: %d\n", gotHits.getHits().length, gotHits.getTotalHits().value);
for (SearchHit hit : gotHits) {
// System.out.println(hit.getSourceAsString());
Map<String, Object> mapHit = hit.getSourceAsMap();
String name = mapHit.get("name").toString();
Integer age = Integer.valueOf(mapHit.get("age").toString());
System.out.printf("name: %s, age: %d", name, age);
}
} catch (Exception ex) {
System.out.println(index + " query fail: " + ex);
}
}
SearchSourceBuilder
SearchSourceBuilder作为SearchRequest的source内容,决定了查询条件、数量、排序方式、获取内容等:
- from/size:用于分页获取(从from开始获取size条);默认从0获取10条;最大10000(from+size<=10000;超过此值,只能通过scroll方式获取);
- sort:设定排序方式;
- query(QueryBuilder):设定查询过滤器;
- aggregation(AggregationBuilder):设定聚合方式;
- collapse(CollapseBuilder):折叠去重;
- suggest(SuggestBuilder):设定提示建议(根据匹配给出输入提示);
- postFilter(QueryBuilder):设定后置过滤器(可以不影响聚合,聚合后再过滤);
- fetchSource:设定要查询的列;包含列(includes)必须是ES中真实存在的列(使用new String[0]可只返回系统列),排除列(excludes)可以包含不存在的列(系统列:
_doc/_score/_id
等无法排除); - timeout:设定查询超时时间;
- terminateAfter(int n):检索结果数量达到n时,提前终止检索;
- highlighter(HighlightBuilder):设定高亮显示;
private SearchSourceBuilder buildTermQuerySource(String field, String word) {
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
boolQuery.must(QueryBuilders.termQuery(field, word));
sourceBuilder.query(boolQuery);
sourceBuilder.from(0);
sourceBuilder.size(5);
sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS));
sourceBuilder.sort("name.keyword");
String[] excludeFields = new String[]{"@time", "@version"};
sourceBuilder.fetchSource(null, excludeFields);
return sourceBuilder;
}
查询
Elastic默认分词后以小写存放(field.keyword则原样存放完整内容);若要匹配分词则需要小写(若方法支持分词,会在分词时自动转为小写)。
QueryBuilders
通过QueryBuilders可方便地构造查询条件:
- matchAllQuery:匹配所有;
- termQuery:精确匹配,且大小写敏感;termsQuery可一次匹配多个值;
- matchPhraseQuery:分词(且要求顺序一致),方便用于中文精确匹配;
- queryStringQuery:可匹配多个,并支持
AND/OR
; - fuzzyQuery:模糊匹配(可以设定相差多少个字符);
- prefixQuery:前缀匹配;
- wildcardQuery:模糊匹配(
*
匹配0个或多个字符,?
匹配一个字符);对于斜线\\
,需要做转义处理(\\\\
); - rangeQuery:范围匹配
- from/to:设定范围的开始与结束,
.from("fieldValue1").to("fieldValue2").includeUpper(false).includeLower(false)
; - gt/gte/lt/lte:大于小于比较(相等直接用termQuery);
- boolQuery:组合条件查询:
- must:相当于
AND
; - mustNot:相当于
NOT
; - should:相当于
OR
; - filter:过滤;返回值必须满足filter子句的条件,但不会像must一样,参与计算分值;
QueryStringQuery
QueryStringQuery 通过 fields 可以指定多个字段对索引中的文档进行查询(不指定时,对所有字段进行查询)!查询字符串中的多个词语(term)在查询匹配时,默认是OR(或)的运算关系(通过 default_operator 可以可修改查询字符串默认使用的运算方式)。
QueryStringQuery 通过指定多个查询字段以及复杂的布尔运算,可以精确的获取文档数据;在查询字符串中:
- 支持
AND/OR/NOT
进行布尔运算:如big AND fat
(符号前后要有空格); - 支持
+(must)
和-(must not)
:如+dog -cat
(有狗,没猫的); - 通过
:
限定列:如
-
age:18
,查询age为18记录; -
name:?*
,查询name不为空的记录(?
匹配任一字符,*
匹配零或任意数量字符); -
age:18 AND NOT name:?*
,查询age为18且name为空的记录;
private SearchSourceBuilder buildQueryStringSource(String field, String word) {
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
QueryBuilders.queryStringQuery(word)
.field(field)
.analyzeWildcard(true)
.defaultOperator(Operator.AND);
boolQuery.must(QueryBuilders.termQuery(field, word));
sourceBuilder.query(boolQuery);
sourceBuilder.from(0);
sourceBuilder.size(5);
sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS));
return sourceBuilder;
}
SimpleQueryStringQuery是QueryStringQuery的简化版,其本身不支持 AND OR NOT 布尔运算关键字,这些关键字会被当做普通词语进行处理。
排序
ES默认都是按照_score来排序的,可通过sourceBuilder.sort(field,SortOrder.DESC)
来自定义排序,可以对多个字段进行排序(最先加入的字段优先级最高)。SortBuilder有四种特殊的实现:
- FieldSortBuilder:根据某个特殊字段排序;对于文本字段排序,需要**使用
field.keyword
**作为排序的字段名; - ScoreSortBuilder:根据score排序;
- GeoDistanceSortBuilder:根据地理位置排序;
- ScriptSortBuilder:根据自定义脚本排序;
private SearchSourceBuilder buildSortSource(String ...field){
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.from(0);
sourceBuilder.size(20);
for(String f:field)
sourceBuilder.sort(f, SortOrder.DESC);
return sourceBuilder;
}
buildSortSource("age", "name.keyword");
// 先以age字段排序,age相同的使用根据name排序
游标Scroll
ES查询每次最多返回10000条记录,要获取其后的数据,就需要使用Scroll查询;在游标使用完成后,需要清理,避免影响后续其他查询,及释放资源。
public void scrollSearch(String index) {
// 设定游标每次查询的超时时间
final Scroll scroll = new Scroll(TimeValue.timeValueSeconds(60));
SearchRequest searchRequest = new SearchRequest(index);
searchRequest.scroll(scroll);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.size(5);
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
boolQuery.must(QueryBuilders.rangeQuery("age").gt(0));
searchSourceBuilder.query(boolQuery);
searchRequest.source(searchSourceBuilder);
try (RestHighLevelClient rhlClient = ESClient.getClient()) {
SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
SearchHits gotHits = response.getHits();
while (gotHits.getHits().length>0){
System.out.printf("ScrollId: %s, size: %d, total: %d\n", response.getScrollId(),
gotHits.getHits().length, gotHits.getTotalHits().value);
for(SearchHit hit : gotHits.getHits()){
System.out.println(hit.getSourceAsString());
}
// scroll query
SearchScrollRequest scrollRequest = new SearchScrollRequest(response.getScrollId());
scrollRequest.scroll(scroll);
response = rhlClient.scroll(scrollRequest, RequestOptions.DEFAULT);
gotHits = response.getHits();
System.out.println();
}
// clear scroll query(must clear to avoid affect other query)
ClearScrollRequest clearRequest = new ClearScrollRequest();
clearRequest.addScrollId(response.getScrollId());
ClearScrollResponse clearResponse = rhlClient.clearScroll(clearRequest, RequestOptions.DEFAULT);
System.out.println("clear scroll: " + clearResponse.isSucceeded());
}catch (Exception ex) {
System.out.println(index + " query fail: " + ex);
}
}
聚合
RestHighLevelClient中用AggregationBuilder构造组条件:
- Buckets(桶):满足某个条件的文档集合;通过
getDocCount()
可获取桶中文档数量; - Metrics(指标):为某个同种的文档计算得到的统计信息;
一个聚合就是一些桶和指标的组合。一个聚合可以只有一个桶,或者一个指标,或者每样一个;在桶中甚至可以有多个嵌套的桶。
AggregationBuilders
AggregationBuilders用于构造聚合条件;构造参数为名称,用于标识此聚合(后续获取次聚合时需要此name);对应的列通过.field(f)
设定;通过subAggregation
组合子聚合:
- count(name):统计数量;
- avg(name):平均值;
- max(name):最大值;
- min(name):最小值;
- sum(name):累加值;
- stats(name):统计信息(均值、方差等);
- filter(name, QueryBuilder):过滤条件;多个条件时,使用filters;
- range(name):统计一个范围,通过addUnboundedTo/addRange/addUnboundedFrom分别设定上限、范围与下限[from,to);
- missing(name):对应字段缺失的分组聚合;
- terms(name):按指定字段聚合;
- topHists(name):获取聚合(桶)里面的文档详情信息;
- histogram(name):直方图聚合,通过interval设定间隔;
- dateHistogram(name):时间直方图聚合查询;字段需是日期时间类型;通过
.dateHistogramInterval
设定聚合粒度(时分秒等),format设定日期格式(以 key_as_string 的字符串类型返回); - dateRange(name):对日期范围聚合,通过format设定日期格式,范围设定通过range;
- ipRange(name, GeoPoint):对IP地址范围聚合;
- geoDistance(name):地理距离聚合;
- nested(name,path):嵌入式子对象(子类)聚合;
常见参数参数说明:
- field:聚合对应的字段;对于文本,很可能需要使用
field.keyword
; - size:返回桶的个数,默认10;
- min_doc_count:最少文档过滤,文档数少于指定值的桶不会被返回;
- order:桶排序;
- missing:为缺失字段设定默认值;
- ranges:配置区间数组,如[{from:0}, {from:50, to:100}, {to:200}];
- subAggregation:添加子桶;
对于指标(count/avg/max/min/stats)只是做统计,需要在某个分组下,且不会分组(不会生成新的子分组):
private static void AggregateSumAndAvgByAge(String index) {
try (RestHighLevelClient rhlClient = ESClient.getClient()) {
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
TermsAggregationBuilder termAggregation = AggregationBuilders.terms("ageTerm").field("age")
.subAggregation(AggregationBuilders.sum("sum").field("number"))
.subAggregation(AggregationBuilders.avg("avg").field("number"))
.subAggregation(AggregationBuilders.topHits("details").size(2));
sourceBuilder.aggregation(termAggregation);
SearchRequest searchRequest = new SearchRequest(index);
searchRequest.source(sourceBuilder);
SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
// getAggregations获取聚合后的数据
Aggregations aggAge = response.getAggregations();
Terms ageTerms = aggAge.get("ageTerm");
for (Terms.Bucket bucket : ageTerms.getBuckets()) {
System.out.println("bucket of " + bucket.getKeyAsString()+ ", count: " + bucket.getDocCount());
Aggregations aggNumber = bucket.getAggregations();
ParsedSum sumTerm = aggNumber.get("sum");
ParsedAvg avgTerm = aggNumber.get("avg");
System.out.println("\tsum: " + sumTerm.getValue() + ", avg: " + avgTerm.getValue());
ParsedTopHits topHits = aggNumber.get("details");
for(SearchHit detail : topHits.getHits()){
System.out.println("\t" + detail.getSourceAsString());
}
}
} catch (Exception ex) {
System.out.println(index + " query fail: " + ex);
}
}
nested嵌套聚合
netsted相当于文档中的子文档(类似字表);其的查询和聚合性能很好;更新性能一般。示例中的courses
即为子文档,处理其内容就需要使用嵌套查询、聚合。
AggregationBuilder aggregation = AggregationBuilders.nested("course", "courses")
.subAggregation(AggregationBuilders.terms("hour").field("courses.hours"));
在nested中设定路径,创建聚合时的field名称要携带路径;
排序
聚合排序通过order设定,但TopHits要使用order(与查询类似):
private static void AggregateByAge(String index) {
try (RestHighLevelClient rhlClient = ESClient.getClient()) {
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
TermsAggregationBuilder termAggregation = AggregationBuilders
.terms("ageTerm")
.field("age")
.order(BucketOrder.key(true))
.subAggregation(
AggregationBuilders
.topHits("details")
.sort("name.keyword", SortOrder.ASC)
.size(10)
);
sourceBuilder.aggregation(termAggregation);
SearchRequest searchRequest = new SearchRequest(index);
searchRequest.source(sourceBuilder);
SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
Aggregations aggAge = response.getAggregations();
Terms ageTerms = aggAge.get("ageTerm");
for (Terms.Bucket bucket : ageTerms.getBuckets()) {
System.out.println("bucket of " + bucket.getKeyAsString() + ", count: " + bucket.getDocCount());
Aggregations aggDetail = bucket.getAggregations();
ParsedTopHits topHits = aggDetail.get("details");
for (SearchHit detail : topHits.getHits()) {
System.out.println("\t" + detail.getSourceAsString());
}
}
} catch (Exception ex) {
System.out.println(index + " query fail: " + ex);
}
}
查询聚合
查询与聚合可一起使用,只聚合满足条件的记录:
private static void FilterAndAggregate(String index) {
try (RestHighLevelClient rhlClient = ESClient.getClient()) {
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.size(0); // 不需要查询内容,设定为0
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
boolQuery.must(QueryBuilders.rangeQuery("number").gt(1500));
sourceBuilder.query(boolQuery);
TermsAggregationBuilder termAggregation = AggregationBuilders
.terms("ageTerm")
.field("age")
.order(BucketOrder.key(true));
sourceBuilder.aggregation(termAggregation);
SearchRequest searchRequest = new SearchRequest(index);
searchRequest.source(sourceBuilder);
SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
// getAggregations获取聚合后的数据
Aggregations aggAge = response.getAggregations();
Terms ageTerms = aggAge.get("ageTerm");
for (Terms.Bucket bucket : ageTerms.getBuckets()) {
System.out.println("bucket of " + bucket.getKeyAsString() + ", count: " + bucket.getDocCount());
}
} catch (Exception ex) {
System.out.println(index + " query fail: " + ex);
}
}
collapse去重
聚合去重时,默认返回统计数量;而collapse去重后,则从相同数据中选择一条返回;而且collapse可以与from/size配合进行分页处理:
- getHits().length:返回的是当次查询结果数量(去重后);
- getTotalHits().value:是满足条件的所有记录条数;没有去重后总体的数量;
public void collapseSearch(String index, String field) {
try (RestHighLevelClient rhlClient = ESClient.getClient()) {
SearchRequest searchRequest = new SearchRequest(index);
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.collapse(new CollapseBuilder(field));
sourceBuilder.from(0);
sourceBuilder.size(20);
searchRequest.source(sourceBuilder);
SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
SearchHits gotHits = response.getHits();
System.out.printf("get size: %d, total size: %d\n", gotHits.getHits().length, gotHits.getTotalHits().value);
for (SearchHit hit : gotHits) {
System.out.println(hit.getSourceAsString());
}
} catch (Exception ex) {
System.out.println(index + " query fail: " + ex);
}
}