springboot 防止爬虫 springboot写爬虫

转载

ctaxnews 2024-06-05 13:17:18

文章标签 springboot 防止爬虫 java spring System 数据 文章分类 架构后端开发

文章目录

前言
一、导包
二、使用步骤

1.引入库
2.读入数据

总结

前言

教大家如何爬虫的小技巧，以及将爬虫到的数据存放到es下面，显示模糊匹配查询，在页面展示

提示：以下是本篇文章正文内容，下面案例可供参考

一、pandas是什么？

首先引入依赖，我们在使用idea创建项目的时候勾选一个lombok，springboot集成的es,springweb。

教大家一个小的技巧，其实学习不管是学习框架还是写web项目记住三个步骤:1,导入依赖2，编写配置3.开启注解。

本项目由于比较简单的例子所有我写的东西知识为了demo，并不是真正做项目

二、使用步骤

1.引入库

<groupId>org.projectlombok</groupId>
    <artifactId>lombok</artifactId>
    <optional>true</optional>
</dependency>
<!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>fastjson</artifactId>
    <version>1.2.62</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.11.3</version>
</dependency>
<dependency>
    <groupId>net.sourceforge.nekohtml</groupId>
    <artifactId>nekohtml</artifactId>
    <version>1.9.22</version>
</dependency>
<!--解析html包-->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-thymeleaf</artifactId>
</dependency>

注：jsoup包是为了进行爬数据的。目前springboot的2.3.3版本支持的es是7.6.2，我本地的es是6.1的版本，集成起来是没问题。大家尽量还是统一一下版本比较好一点，如果大家想让springboot集成的es版本和本地客户端es版本一致，修改一下pom文件里面的在properties标签里面添加

<elasticsearch.version>自己本地的版本</elasticsearch.version>

由于我本地环境的问题，springboot集成thymeleaf报错，所以我引入了一个nekohtml包，springboot配置文件里面如下，如果大家引入的thymeleaf魔板没问题，默认下面的配置是不需要的

#<!-- 关闭thymeleaf缓存 开发时使用 否则没有实时画面-->
spring.thymeleaf.cache=false
## 检查模板是否存在，然后再呈现
spring.thymeleaf.check-template-location=true
#Content-Type值
spring.thymeleaf.content-type=text/html
#启用MVC Thymeleaf视图分辨率
spring.thymeleaf.enabled=true
## 应该从解决方案中排除的视图名称的逗号分隔列表
##spring.thymeleaf.excluded-view-names=
#模板编码
spring.thymeleaf.mode=LEGACYHTML5
# 在构建URL时预先查看名称的前缀
spring.thymeleaf.prefix=classpath:/templates/
# 构建URL时附加查看名称的后缀.
spring.thymeleaf.suffix=.html
# 链中模板解析器的顺序
#spring.thymeleaf.template-resolver-order= o
# 可以解析的视图名称的逗号分隔列表
#spring.thymeleaf.view-names=
#thymeleaf end

2.读入数据

1.我们先进行数据的获取，打开京东，然后我们发现数据放在div的标签里面，所以我们先获取该div

springboot 防止爬虫 springboot写爬虫_System

public  static List<Goods> getAllGoods(String keyword) throws IOException {
    String  url="https://search.jd.com/Search?keyword="+keyword;
    //首先获取文档对象，有点dom4j解析文档，也有点想前端的js
    Document document=  Jsoup.parse(new URL(url),30000);
    //获取存放商品列表的标签
    Element element=document.getElementById("J_goodsList");
     /*Elements elements=element.getElementsByClass("gl-warp clearfix");
    System.out.println("elements====:"+elements);*/
    //获取div下面的li标签
    Elements elements = element.getElementsByTag("li");
    System.out.println(elements);
    List<Goods> goods=new ArrayList<>();
    //遍历li标签，将获取道德数据保存到集合里面，将图片保存到本地
    //这里面有一个问题就是当你获取图片的时候可能或出现获取不到路径，因为京东的图片是懒加载的，所以获取的标签问题，
    //还有就是获取商店的名称时候有时候会出现div的class不一样，所以我这里面加了一个判断
    for (Element el : elements) {
        String price = el.getElementsByClass("p-price").eq(0).text();
        String title = el.getElementsByClass("p-name").eq(0).text();
        //System.out.println("img:"+el.getElementsByTag("img").eq(0).attr("src"));
        String img=el.getElementsByTag("img").eq(0).attr("src");
        String shopName="";
        shopName=el.getElementsByClass("p-shopnum").eq(0).text();
        if("".equals(shopName)){
            shopName=el.getElementsByClass("p-shop").eq(0).text();
        }

        if(!"".equals(price)&&!"".equals(img)){
            goods.add(new Goods(price,title,img,shopName));
            //文件图片保存到本地
            String path="D://jdphoto//"+System.currentTimeMillis()+".jpg";
            System.out.println(uploadQianURL(img,path));
        }
    }
    return goods;
}

我只是截取了一个方法，封装了一类pojo类和一个保存文件的方法，具体百度网盘下面获取

2.我们将从京东网页获取的数据保存到es本地

public boolean insertGoods(String keyword) throws IOException {
    //此类是es里面的批量处理的一个类，我们将从京东获取的数据，保存的es下面，首先需要自己创建一个jd_goods索引
    BulkRequest bulkRequest =new BulkRequest();
    bulkRequest.timeout("2s");
    List<Goods> java = PraseUtil.getAllGoods(keyword);
    System.out.println("关键查询结果keyword======="+java);
    for (int i = 0; i < java.size(); i++) {
        //将数据放到bulkRequest对象中
        bulkRequest.add(new IndexRequest("jd_goods")
                .source(JSON.toJSONString(java.get(i)), XContentType.JSON));
    }
    //进行批量插入
    BulkResponse bulk = restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
    //判断是否插入成功
    return !bulk.hasFailures();
}

3.我们就要进行查询

/**
 *
 * @param keyword 关键字
 * @param pageIndex 从第几条开始
 * @param pageSize 查询多少条
 * @return
 * @throws IOException
 */
public List<Map<String,Object>>  findHightLightAll(String keyword,int pageIndex,int pageSize) throws IOException {
    System.out.println("keyword:"+keyword+",pageIndex:"+pageIndex+",pageSize:"+pageSize);
    if(pageIndex<=1){
        pageIndex=1;
    }
    //构建查询对象
    SearchRequest searchRequest=new SearchRequest("jd_goods");
    SearchSourceBuilder builder=new SearchSourceBuilder();
    //分页
    builder.from(pageIndex);
    builder.size(pageSize);
    //设置查询的超时时间
    builder.timeout(new TimeValue(60, TimeUnit.SECONDS));
    //将标题字段设置高亮，就是我们输入关键字的时候，查询到的数据带有该关键字会高亮，颜色自己设置，我设置的红色
    HighlightBuilder highlightBuilder=new HighlightBuilder();
    //将title设置成高亮
    highlightBuilder.field("title");
    highlightBuilder.requireFieldMatch(false);
    //设置前缀和后缀
    highlightBuilder.preTags("<span style='color:red;'>");
    highlightBuilder.postTags("</span>");
    //高亮对象放到builder里面
    builder.highlighter(highlightBuilder);
     //精确匹配这里面有一个问题，采用精确匹配中文会出现查询不到的问题那是疑问字段类型问题，所以我这边没有改，keyword类型可以
    //TermQueryBuilder termQueryBuilder=new TermQueryBuilder("title",keyword);
    //采用模糊匹配
    builder.query(QueryBuilders.matchQuery("title",keyword));
    searchRequest.source(builder);
    SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
   //获取命中的数据，es数据交互采用的json格式,kibana里面大家可以去试试这个工具，里面数据交互用的就是json
    //es集成springboot其实就是将json格式的数据都转化成了对象的形式，增删改查对象以及其他的一些数据对象都进行了封装
    SearchHits hits = searchResponse.getHits();
    System.out.println("hits=====:"+hits.getHits().length);
    List<Map<String,Object>> list=new ArrayList<>();
    for (SearchHit hit : hits.getHits()) {
        //先获取高亮字段
        Map<String, HighlightField> highlightFields = hit.getHighlightFields();
        HighlightField title = highlightFields.get("title");
        System.out.println("title=====:"+title);
        Text[] fragments = title.fragments();//获取高亮字段

        //将原来的字段替换成高亮字段
        Map<String, Object> sourceAsMap = hit.getSourceAsMap();
        if(title!=null){
            String n_title="";
            for(Text t:fragments){
                n_title+=t;
            }

            sourceAsMap.put("title",n_title);
        }

        list.add(sourceAsMap);
    }
    System.out.println("list=====:"+list);
    return list;
}

4.既然集成springbootweb项目当然少不了controller层，我就简单的写了一下，/add/手机，我们将数据添加进来，然后在主页查询出来

private GoodsService service;
@RequestMapping({"/","/index"})
public String index(){
    return "index";
}
@RequestMapping("/add/{keyword}")
@ResponseBody
public Boolean add(@PathVariable("keyword")String keyword) throws IOException {
    System.out.println("keyword==========："+keyword);
    return service.insertGoods(keyword);
}
@RequestMapping("/find/{keyword}/{pageIndex}/{pageSize}")
@ResponseBody
public List<Map<String,Object>> add(@PathVariable("keyword")String keyword,
                                    @PathVariable("pageIndex")int pageIndex,
                                    @PathVariable("pageSize")int pageSize
                   ) throws IOException {
    return service.findHightLightAll(keyword,pageIndex,pageSize);
}

5.效果展示

springboot 防止爬虫 springboot写爬虫_System_02