ElasticSearch

Author: CodingGorit

Date: 2020年10月22日

Note:学习笔记记录自 B站狂神说:ElasticSearch 学习

一、学习大纲
  1. 安装
  2. 生态圈
  3. 分词器 lk
  4. RestFul 操作 ES
  5. CRUD
  6. SpringBoot 继承 ElasticSearch (从原理分析!!!)
  7. 爬虫爬取数据!!! 京东
  8. 实战,模拟全文检索

搜索相关使用 ES(大数据量下使用)

Lucene 是一套信息检索工具包 (Jar 包,不包含 搜索引擎系统)! Solr

包含的:索引结构!读写索引的工具!排序,搜索规则… 工具类

Lucene 和 EslasticSearch 关系:

ElasticSearch 是基于 Lucene 做了一些封装 和 增强

二、ElasticSearch 概述

简称 es

  • 一个开源的高扩展的 分布式全文检索引擎
  • 近乎实时的存储,检索数据
  • es使用 java 开发并使用 Licene 作为其核心来实现所有索引 和 搜索功能
  • 它的目的是通过简单的 RESTFul API,来隐藏 Lucene 的复杂性,从而让全文搜索变得简单
三、ElasticSearch 安装
  • JDK 1.8
  1. 下载,解压

  2. 熟悉目录:

bin: 启动文件
	config: 配置文件
	log4j: 日志文件
	jvm.options: java 虚拟机先关的配置
	elasticsearch.xml:	elasticsearch 的配置文件!
lib: 相关 jar 包
logs: 日志
modules: 功能模块
plugins: 插件 ik	
	
  1. 启动,访问 9200
  2. 访问测试:localhost:9200

安装可视化插件 es head 插件

  1. 下载地址:https://github.com/mobz/elasticsearch-head/
  2. 启动
npm install
npm run start

在 elasticSearch.yml 配置跨域

http.cors.enabled: true
http.cors.allow-origin: "*"

安装 kibana

  1. 下载,解压
  2. 国际化

找到 config 下的 kibana.yml 文件,修改最后一行为 i18n.locale: “zh-CN”

四、ES 核心概念
  1. 索引
  2. 字段类型 (mapping)
  3. 文档(documents)

集群、节点、索引、类型、文档、分片、映射是什么?

ElasticSearch 是面向文档,关系型数据库 和 elasticSearch 客观的对比! 一切都是 JSON

{

}

名词对应

ElasticSearch Relational DB
索引(indices) 数据库(database)
types 表(tables)
documents 行(rows)
fields 字段(columns)

elasticSearch (集群)中可以包含多个索引(数据库),每个索引中可以包含多个类型(表),每个类型下又包含多个文档(行),每个文档又包含多个字段(列)

物理设计

elasticSearch 一个就是一个集群

文档

一条条记录

user
	zs: 15
	ls: 22

类型

自动识别, string,

索引

数据库

五、IK 分词器插件

下载好的添加到 plugin 中

跳过,第 8 集

  • elasticsearch-plugin 可以通过这个命令来查看加载进来的插件

  • ik_smart(最少切分) 和 ik_max_word(最细粒度划分)

  • kibana 测试

  • 自定义分词

六、 Rest 风格说明

基础 Rest 命令

method url 地址 描述
PUT localhost:9200/索引名称/类型名称/文档 id 创建文档以及更新文档(指定文档 id),如果文档 id 不变,重复提交,是可以直接覆盖之前的数据
POST localhost:9200/索引名称/类型名称 创建文档(随机文档 id)
POST localhost:9200/索引名称/类型名称/文档id/_update 修改文档
DELETE localhost:9200/索引名称/类型名称/文档id 删除文档
GET localhost:9200/索引名称/类型名称/文档id 查询文档通过文档 id
POST localhost:9200/索引名称/类型名称/_seaarch 查询所有数据

基本测试

6.1 创建索引

  1. 创建一个索引
PUT /索引名/~类型名~/文档id
{
  "name":"Gorit",
  "age": 18,
  "gender": "male"
}

返回值,数据成功添加

#! Deprecation: [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
{
  "_index" : "test",
  "_type" : "type1",
  "_id" : "1", 
  "_version" : 1, // 修改次数
  "result" : "created", // 状态
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

  1. 创建索引规则
PUT /test1/
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "age": {
        "type": "long"
      },
      "birthday": {
        "type": "date"
      }
    }
  }
}

返回值

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "test1"
}

es 默认配置字段类型!

6.2 查询

GET test

# 结果
{
  "test" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "age" : {
          "type" : "long"
        },
        "gender" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1603203146037",
        "number_of_shards" : "1",
        "number_of_replicas" : "1",
        "uuid" : "q47lWt_4ToOBo1rxQ1pPNw",
        "version" : {
          "created" : "7060299"
        },
        "provided_name" : "test"
      }
    }
  }
}


GET test1

{
  "test1" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "age" : {
          "type" : "long"
        },
        "birthday" : {
          "type" : "date"
        },
        "name" : {
          "type" : "text"
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1603203453667",
        "number_of_shards" : "1",
        "number_of_replicas" : "1",
        "uuid" : "a-upVXJwR7u7JZztTjyVGg",
        "version" : {
          "created" : "7060299"
        },
        "provided_name" : "test1"
      }
    }
  }
}

扩展:通过 _cat/ 可以获得 es 当前很多的信息

GET _cat/health

GET _cat/indices?v

6.3 修改索引

提交 PUT,覆盖即可

修改数据

PUT /test/type1/1
{
  "name":"Gorit111",
  "age": 18,
  "gender": "male"
}

修改结果

#! Deprecation: [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
{
  "_index" : "test",
  "_type" : "type1",
  "_id" : "1",
  "_version" : 2,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}

新的方法 POST 命令更新

POST /test/_doc/1/_update
{
  "doc": {
      "name":"张三"
  }
}

// 结果
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 3,
  "_seq_no" : 2,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "张三",
    "age" : 18,
    "gender" : "male"
  }
}

6.4 删除索引

删除索引!!!

DELETE test

通过 delete 命令实现删除,根据你的请求来判断删除的是索引 还是 文档

七、关于文档的操作

7.1 基本操作 (复习巩固)

  1. 添加数据(添加多条记录)
PUT /gorit/user/1
{
  "name": "CodingGorit",
  "age": 23,
  "desc": "一个独立的个人开发者",
  "tags": ["Python","Java","JavaScript"]
}

PUT /gorit/user/2
{
  "name": "龙",
  "age": 20,
  "desc": "全栈工程师",
  "tags": ["Python","JavaScript"]
}

结果:

#! Deprecation: [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
{
  "_index" : "gorit",
  "_type" : "user",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}
  1. 获取数据
GET /gorit/user/_search   # 查询所有数据

GET /gorit/user/1 # 查询单个数据
  1. 更新数据 PUT
PUT /gorit/user/3
{
  "name": "李四222",
  "age": 20,
  "desc": "Java开发工程师",
  "tags": ["Python","Java"]
}

# PUT 更新字段不完整,数据会被滞空
  1. post _update , 推荐使用这种方式!
# 修改方式和 PUT 一样会使数据滞空
POST /gorit/user/1
{
  "doc": {
    "name": "coco"
  }
}

# 修改数据不会滞空, 效率更加高效
POST /gorit/user/1/_update
{
  "doc": {
    "name": "coco"
  }
}

简单的搜索!

# 查询一条记录
GET /gorit/user/1

# 查询所有
GET /gorit/user/_search

# 条件查询 [精确匹配] ,如果我们没有个这个属性设置字段,它会背默认设置为 keyword,这个 keyword 字段就是使用全匹配来匹配的,如果是 text 类型,模糊查询就会起效果
GET /gorit/user/_search?q=name:coco

7.2 复杂的查询搜索:select(排序、分页、高亮、模糊查询、精确查询)!

  1. 过滤加指定字段查询
GET /gorit/user/_search
{
  "query": {
    "match": {
      "name": "李四"
    }
  },
  "_source": ["name","desc"]
}

7.3 排序

GET /gorit/user/_search
{
  "query": {
    "match": {
      "name": "gorit"
    }
  },
  "sort": [
    {
     "age": {
       "order": "desc"
     }
    }
  ]
}

7.4 分页查询

使用字段 from 和 size 进行分页查询,方式和 limit pageSize 是一模一样的

  1. from 从第几页开始
  2. 返回多少条数据
GET /gorit/user/_search
{
  "query": {
    "match": {
      "name": "李四"
    }
  },
  "sort": [
    {
     "age": {
       "order": "desc"
     }
    }
  ],
  "from": 0,
  "size": 1
}

7.5 filiter 区间查询

# 根据年龄的范围大小查询
GET /gorit/user/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "gorit"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "age": {
              "gte": 1,
              "lte": 25
            }
          }
        }
      ]
    }
  }
}
  • gt 大于
  • gte 大于等于
  • lt 小于
  • lte 小于等于

7.6 布尔值查询

must (and), 所有的条件都要符合 where id=1 and name = xxx

# 布尔查询
GET /gorit/user/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "gorit"
          }
        },{
          "match": {
            "age": "16"
          }
        }
      ]
    }
  }
}

7.7 匹配多个条件

同时匹配即可

# 多个条件用空格隔开,只要满足一个即可被查出,这个时候可以根据分值判断
GET /gorit/user/_search
{
  "query": {
    "match": {
      "tags": "Java Python"
    }
  }
}

7.7 精确查询

term 查询是直接通过倒排索引指定的词条进程精确的查找的!

关于分词

  • term,直接精确查询

  • match:会使用分词器解析!!(先分析文档,然后通过分析的文档进行查询!!!)

两个类型 text keyword

结论:

  • text 可分
  • keyword 不可再分

7.8 高亮查询

# 高亮查询, 搜索的结果,可以高亮显示, 也能添加自定义高亮条件
GET /gorit/user/_search
{
  "query": {
    "match": {
      "name": "Gorit"
    }
  },
  "highlight": {
    "pre_tags": "<h3 class='key' style='color:red'>", 
    "post_tags": "</h3>", 
    "fields": {
      "name": {}
    }
  }
}

# 响应结果
#! Deprecation: [types removal] Specifying types in search requests is deprecated.
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.6375021,
    "hits" : [
      {
        "_index" : "gorit",
        "_type" : "user",
        "_id" : "6",
        "_score" : 1.6375021,
        "_source" : {
          "name" : "Gorit",
          "age" : 16,
          "desc" : "运维工程师",
          "tags" : [
            "Linux",
            "c++",
            "python"
          ]
        },
        "highlight" : {
          "name" : [
            "<h3 class='key' style='color:red'>Gorit</h3>"
          ]
        }
      }
    ]
  }
}

这些 MySQL 也可以做,只是 MySQL 效率更低

  • 匹配
  • 按照条件匹配
  • 精确匹配
  • 区间范围匹配
  • 匹配字段过滤
  • 多条件查询
  • 高亮查询
  • 倒排索引
八、集成 SpringBoot

找官方文档

具体测试

  1. 创建索引
  2. 判断索引是否存在
  3. 删除索引
  4. 创建文档
  5. 操作文档
// 坐标依赖
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-data-elasticsearch</artifactId>
		</dependency>

// 核心代码            
package cn.gorit;

import cn.gorit.pojo.User;
import com.alibaba.fastjson.JSON;
import javafx.scene.control.IndexRange;
import org.apache.lucene.util.QueryBuilder;
import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;
import org.elasticsearch.action.bulk.BulkRequest;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.action.delete.DeleteRequest;
import org.elasticsearch.action.delete.DeleteResponse;
import org.elasticsearch.action.get.GetRequest;
import org.elasticsearch.action.get.GetResponse;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.support.master.AcknowledgedRequest;
import org.elasticsearch.action.support.master.AcknowledgedResponse;
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.action.update.UpdateResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.client.indices.CreateIndexResponse;
import org.elasticsearch.client.indices.GetIndexRequest;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.common.xcontent.XContent;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.MatchAllQueryBuilder;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.index.query.TermQueryBuilder;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.fetch.subphase.FetchSourceContext;
import org.json.JSONObject;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.http.codec.cbor.Jackson2CborDecoder;

import java.io.IOException;
import java.util.ArrayList;
import java.util.concurrent.TimeUnit;

/**
 * es 7.6.2 API 测试
 */
@SpringBootTest
class DemoApplicationTests {

	// 名称匹配
	@Autowired
	@Qualifier("restHighLevelClient")
	private RestHighLevelClient client;

	@Test
	void contextLoads() {

	}
	// 索引的创建
	@Test
	void testCreateIndex() throws IOException {
		// 1. 创建索引请求  等价于 PUT /gorit_index
		CreateIndexRequest request = new CreateIndexRequest("gorit_index");
		// 2. 执行创建请求 IndicesClient, 请求后获得响应
		CreateIndexResponse response = client.indices().create(request, RequestOptions.DEFAULT);
		System.out.println(response);
	}

	// 测试获取索引,判断其是否存在
	@Test
	void testGetIndexExist() throws IOException {
		GetIndexRequest request = new GetIndexRequest("gorit_index");
		boolean exist = client.indices().exists(request,RequestOptions.DEFAULT);
		System.out.println(exist);
	}

	// 删除索引
	@Test
	void testDeleteIndex() throws IOException {
		DeleteIndexRequest request = new DeleteIndexRequest("gorit_index");
		// 删除
		AcknowledgedResponse delete	= client.indices().delete(request,RequestOptions.DEFAULT);
		System.out.println(delete.isAcknowledged());
	}

	// 添加文档
	@Test
	void testAddDocument() throws IOException {
		// 创建对象
		User u = new User("Gorit",3);
		// 创建请求
		IndexRequest request = new IndexRequest("gorit_index");

		// 规则 PUT /gorit_index/_doc/1
		request.id("1");
		request.timeout(TimeValue.timeValueSeconds(3));
		request.timeout("1s");

		// 将数据放入请求 json
		IndexRequest source = request.source(JSON.toJSONString(u), XContentType.JSON);
		// 客户端发送请求
		IndexResponse response = client.index(request, RequestOptions.DEFAULT);

		System.out.println(response.toString());
		System.out.println(response.status());// 返回对应的状态 CREATED
	}

	// 获取文档,判断存在  get /index/_doc/1
	@Test
	void testIsExists() throws IOException {
		GetRequest getRequest = new GetRequest("gorit_index", "1");

		// 不获取返回的 _source 的上下文了
		getRequest.fetchSourceContext(new FetchSourceContext(false));
		getRequest.storedFields("_none_");

		boolean exists = client.exists(getRequest, RequestOptions.DEFAULT);
		System.out.println(exists);
	}

	// 获取文档信息
	@Test
	void testGetDocument() throws IOException {
		GetRequest getRequest = new GetRequest("gorit_index", "1");
		GetResponse getResponse = client.get(getRequest, RequestOptions.DEFAULT);
		// 打印文档的内容
		System.out.println(getResponse.getSourceAsString());
		System.out.println(getResponse); // 返回全部的内容和命令是一样的
	}

	// 更新文档信息
	@Test
	void testUpdateDocument() throws IOException {
		UpdateRequest updateRequest = new UpdateRequest("gorit_index", "1");
		updateRequest.timeout("1s");

		User user = new User("CodingGoirt", 18);
		updateRequest.doc(JSON.toJSONString(user),XContentType.JSON);

		UpdateResponse updateResponse = client.update(updateRequest, RequestOptions.DEFAULT);
		// 打印文档的内容
		System.out.println(updateResponse.status());
		System.out.println(updateResponse); // 返回全部的内容和命令是一样的
	}

	// 删除文档记录
	@Test
	void testDeleteDocument() throws IOException {
		DeleteRequest deleteRequest = new DeleteRequest("gorit_index", "1");
		deleteRequest.timeout("1s");

		DeleteResponse deleteResponse = client.delete(deleteRequest, RequestOptions.DEFAULT);
		// 打印文档的内容
		System.out.println(deleteResponse.status());
		System.out.println(deleteResponse); // 返回全部的内容和命令是一样的
	}

	// 特殊的,真的项目。 批量插入数据

	@Test
	void testBulkRequest() throws IOException {
		BulkRequest bulkRequest = new BulkRequest();
		bulkRequest.timeout("10s");

		ArrayList<User> userList = new ArrayList<>();
		userList.add(new User("张三1",1));
		userList.add(new User("张三2",2));
		userList.add(new User("张三3",3));
		userList.add(new User("张三4",4));
		userList.add(new User("张三5",5));
		userList.add(new User("张三6",6));
		userList.add(new User("张三7",7));

		// 批处理请求
		for (int i=0;i<userList.size();i++) {
			// 批量更新,批量删除,就在这里修改为对应的请求即可
			bulkRequest.add(new IndexRequest("gorit_index")
			.id(""+(i+1))
			.source(JSON.toJSONString(userList.get(i)),XContentType.JSON));
		}

		BulkResponse bulkItemResponses = client.bulk(bulkRequest, RequestOptions.DEFAULT);
		System.out.println(bulkItemResponses.hasFailures()); // 是否失败
		System.out.println(bulkItemResponses.status());

	}

	// 查询
	// 	SearchRequest 搜索请求
	//  SearchSourceBuilder条件构造
	// HighlightBuilder 构建高亮
	// TermQueryBuilder 精确查询
	// MatchAllQueryBuilder
	//	xxx QueryBuilder 
	@Test
	void testSearch() throws IOException {
		SearchRequest searchRequest = new SearchRequest("gorit_index");
		// 构建搜索的条件
		SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
		/**
		 *   查询条件  使用 QueryBuilders 工具类来实现
		 * 	 QueryBuilders.termQuery 精确
		 * 	 QueryBuilders.matchAllQueryBuilder() 匹配所有
		 */

		TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("name", "gorit1");//精确查询
//		MatchAllQueryBuilder matchAllQueryBuilder = QueryBuilders.matchAllQuery();

		sourceBuilder.query(termQueryBuilder);
		// 分页
		sourceBuilder.from();
		sourceBuilder.size();
		sourceBuilder.highlighter(); // 设置高亮
		sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS));

		// 构建搜索
		searchRequest.source(sourceBuilder);

		SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
		System.out.println(JSON.toJSONString(searchResponse.getHits()));
		System.out.println("==========================================");
		for (SearchHit documentFields: searchResponse.getHits().getHits()) {
			System.out.println(documentFields.getSourceAsMap());
		}
	}

}

九、实战
项目依赖
        <!-- jSoup 解析网页 -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.2</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.68</version>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-thymeleaf</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-devtools</artifactId>
            <scope>runtime</scope>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-configuration-processor</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
            <exclusions>
                <exclusion>
                    <groupId>org.junit.vintage</groupId>
                    <artifactId>junit-vintage-engine</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

爬虫

配置文件

package cn.gorit.config;

import org.apache.http.HttpHost;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

/**
 * Spring 步骤
 * 1. 找对象
 * 2. 放到 spring 中使用
 * 3. 分析源码
 *
 * @Classname ElasticSearchConfig
 * @Description TODO
 * @Date 2020/10/21 17:20
 * @Created by CodingGorit
 * @Version 1.0
 */
@Configuration // xml -bean
public class ElasticSearchConfig {

    @Bean
    public RestHighLevelClient restHighLevelClient() {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("localhost", 9200, "http")
                )
        );
        return client;
    }

}

爬取京东搜索的内容

config 配置类

package cn.gorit.util;

import cn.gorit.pojo.Content;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Component;

import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;

/**
 * @Classname HtmlParseUtil
 * @Description TODO
 * @Date 2020/10/21 23:17
 * @Created by CodingGorit
 * @Version 1.0
 */
@Component
public class HtmlParseUtil {

//    public static void main(String[] args) throws Exception {
//        new HtmlParseUtil().parseJD("英语").forEach(System.out::println);
//    }

    public List<Content> parseJD(String keyword) throws Exception {
        // 请求 url
        // 联网,不能获取 ajax 数据
        String url = "https://search.jd.com/Search?keyword=wd&enc=utf-8";
        // 解析网页 (返回的  Document 对象)
        Document document = Jsoup.parse(new URL(url.replace("wd",keyword)),30000);
        // 获取所有节点标签
        Element element = document.getElementById("J_goodsList");
        // 获取所有的 li 元素
        Elements elements = element.getElementsByTag("li");
        // 获取元素中的内容
        List<Content> goodsList = new ArrayList<>();
        for (Element e: elements) {
            String img = e.getElementsByTag("img").eq(0).attr("data-lazy-img");
            String price = e.getElementsByClass("p-price").eq(0).text();
            String title = e.getElementsByClass("p-name").eq(0).text();

            goodsList.add(new Content(title,img,price));
//            System.out.println(img);
//            System.out.println(price);
//            System.out.println(title);
        }
        return goodsList;
    }
}

Service 方法

package cn.gorit.service;

import cn.gorit.pojo.Content;
import cn.gorit.util.HtmlParseUtil;
import com.alibaba.fastjson.JSON;
import org.elasticsearch.action.bulk.BulkRequest;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.text.Text;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.index.query.TermQueryBuilder;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightField;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;

/**
 * @Classname ContentService
 * @Description TODO
 * @Date 2020/10/22 18:44
 * @Created by CodingGorit
 * @Version 1.0
 */
@Service
public class ContentService {

    @Autowired
    private RestHighLevelClient restHighLevelClient;

    // 不能直接使用,只要 Spring 容器
    public static void main(String[] args) throws Exception {
        new ContentService().parseContent("java");
    }

    // 1. 解析数据放入 es 索引中
    public Boolean parseContent (String keywords) throws Exception {
        // 获取查询到的列表的信息
        List<Content> contents = new HtmlParseUtil().parseJD(keywords);
        // 把查询到的数据放入 es 中
        BulkRequest bulkRequest = new BulkRequest();
        bulkRequest.timeout("2m");

        for (int i=0;i < contents.size();++i) {
            bulkRequest.add(
                    new IndexRequest("jd_goods")
                    .source(JSON.toJSONString(contents.get(i)),XContentType.JSON));
        }
        BulkResponse bulkResponse = restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
        return !bulkResponse.hasFailures();
    }

    // 2. 获取这些数据,实现基本的搜索功能
    public List<Map<String,Object>> searchPagehighLight   (String keyword, int pageNo,int pageSize) throws IOException {
        if (pageNo <= 1)
            pageNo = 1;

        // 条件清晰
        SearchRequest searchRequest = new SearchRequest("jd_goods");

        SearchSourceBuilder builder = new SearchSourceBuilder();

        builder.from(pageNo);
        builder.size(pageSize);
        // 精准匹配
        TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("title",keyword);
        builder.query(termQueryBuilder);
        builder.timeout(new TimeValue(60, TimeUnit.SECONDS));

        // 高亮
        HighlightBuilder highlightBuilder = new HighlightBuilder();
        highlightBuilder.field("title");
        highlightBuilder.requireFieldMatch(false);
        highlightBuilder.preTags("<span style='color:red'>");
        highlightBuilder.postTags("</span>");
        builder.highlighter(highlightBuilder);

        // 执行搜索
        searchRequest.source(builder);
        SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

        // 解析结果
        ArrayList<Map<String,Object>> list= new ArrayList<>();
        for (SearchHit hit: searchResponse.getHits().getHits()) {
            // 解析高亮的字段
            Map<String, HighlightField> highlightFields = hit.getHighlightFields();
            HighlightField title = highlightFields.get("title");
            Map<String,Object> sourceAsMap = hit.getSourceAsMap();// 原来的结果
            // 解析高亮字段,将原来的字段换成我们高亮的字段即可
            if (title != null) {
                Text[] fragments = title.fragments();
                StringBuilder nTitle = new StringBuilder();
                for (Text text:fragments) {
                    nTitle.append(text);
                }
                sourceAsMap.put("title",nTitle);
            }
            list.add(hit.getSourceAsMap()); // 高亮的字段替换为原来的内容即可
        }
        return list;
    }

    // 2. 获取这些数据,实现基本的搜索功能
    public List<Map<String,Object>> searchPage (String keyword, int pageNo,int pageSize) throws IOException {
        if (pageNo <= 1)
            pageNo = 1;

        // 条件清晰
        SearchRequest searchRequest = new SearchRequest("jd_goods");

        SearchSourceBuilder builder = new SearchSourceBuilder();

        builder.from(pageNo);
        builder.size(pageSize);
        // 精准匹配
        TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("title",keyword);
        builder.query(termQueryBuilder);
        builder.timeout(new TimeValue(60, TimeUnit.SECONDS));


        // 执行搜索
        searchRequest.source(builder);
        SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

        // 解析结果
        ArrayList<Map<String,Object>> list= new ArrayList<>();
        for (SearchHit hit: searchResponse.getHits().getHits()) {

            list.add(hit.getSourceAsMap()); // 高亮的字段替换为原来的内容即可
        }
        return list;
    }
}

Controller

package cn.gorit.controller;

import cn.gorit.pojo.Content;
import cn.gorit.service.ContentService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.bind.annotation.RestControllerAdvice;

import java.io.IOException;
import java.util.List;
import java.util.Map;

/**
 * @Classname ContentController
 * @Description TODO
 * @Date 2020/10/22 18:45
 * @Created by CodingGorit
 * @Version 1.0
 */
@RestController
public class ContentController {

    @Autowired
    private ContentService service;

    /**
     * 将数据添加到 ES 中
     * @param keyword
     * @return
     * @throws Exception
     */
    @GetMapping("/parse/{keyword}")
    public Boolean pares(@PathVariable("keyword")  String keyword) throws Exception {
        return service.parseContent(keyword);
    }

    /**
     * 查询 ES 的数据
     * @param keyword
     * @param pageNo
     * @param pageSize
     * @return
     * @throws IOException
     */
    @GetMapping("/search/{keyword}/{pageNo}/{pageSize}")
    public List<Map<String,Object>> search(@PathVariable("keyword") String keyword,@PathVariable("pageNo") int pageNo, @PathVariable("pageSize") int pageSize) throws IOException {
        if (pageNo == 0) {
            pageNo = 1;
        }
        return service.searchPage(keyword, pageNo, pageSize);
    }
}

前后端分离

POSTMAN 测试

搜索高亮

一套项目,多端运用

十、总结
  1. ElasticSearch 基本使用
  2. SpringBoot 整合 ES
  3. 实战搜索