java爬虫-初识

转载

mb607022e25a607 2021-04-28 22:52:34

文章标签 java爬虫 文章分类 Java 后端开发

想找一些图片做桌面背景，但是又不想一张张去下载，后来就想到了爬虫。。。

对于爬虫我也没具体用过，在网上一顿搜索后写了个小demo。

爬虫的具体思路就是：

1.调用url爬取网页信息

2.解析网页信息

3.保存数据

刚开始还用正则去匹配，获取img标签中的src地址，但是发现有很多不便（主要我正则不太会），后来发现了jsoup这个神器。 jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

以下就用爬取图片为例：

import com.crawler.domain.PictureInfo;import org.bson.types.ObjectId;import org.springframework.data.mongodb.core.MongoTemplate;import org.springframework.data.mongodb.gridfs.GridFsTemplate;import org.springframework.stereotype.Service;import org.apache.commons.io.FileUtils;import org.apache.http.HttpEntity;import org.apache.http.client.ClientProtocolException;import org.apache.http.client.methods.CloseableHttpResponse;import org.apache.http.client.methods.HttpGet;import org.apache.http.impl.client.CloseableHttpClient;import org.apache.http.impl.client.HttpClients;import org.apache.http.util.EntityUtils;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;import org.springframework.util.DigestUtils;import org.springframework.util.StringUtils;import javax.annotation.Resource;import java.io.*;import java.net.HttpURLConnection;import java.net.MalformedURLException;import java.net.URL;import java.net.URLConnection;import java.util.ArrayList;import java.util.List;import java.util.regex.Matcher;import java.util.regex.Pattern;/**
 * 爬虫实现
 *@program: crawler
 * @description
 * @author: wl
 * @create: 2021-01-12 17:56
 **/@Servicepublic class CrawlerService {  /**
     * @param url      要抓取的网页地址
     * @param encoding 要抓取网页编码
     * @return
     */
    public String getHtmlResourceByUrl(String url, String encoding) {
        URL urlObj = null;
        HttpURLConnection uc = null;
        InputStreamReader isr = null;
        BufferedReader reader = null;
        StringBuffer buffer = new StringBuffer();        // 建立网络连接
        try {
            urlObj = new URL(url);            // 打开网络连接
            uc =(HttpURLConnection) urlObj.openConnection();　　　　　　　// 模拟浏览器请求
            uc.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");            // 建立文件输入流
            isr = new InputStreamReader(uc.getInputStream(), encoding);            // 建立缓存导入 将网页源代码下载下来
            reader = new BufferedReader(isr);            // 临时
            String temp = null;            while ((temp = reader.readLine()) != null) {// System.out.println(temp+"\n");
                buffer.append(temp + "\n");
            }
            System.out.println("爬取结束:"+buffer.toString());
        } catch (Exception e) {
            e.printStackTrace();
        } finally {            // 关流
            if (isr != null) {                try {
                    isr.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }        return buffer.toString();
    }   /**
     * 下载图片
     *
     * @param listImgSrc     */
    public void Download(List

主要方法就这些，只要爬取下来的网页信息包含img标签，就能扒下其对应的图片。

java爬虫-初识_java爬虫