前言

前几天有个面试,发了个题,要求

编程题:抓取大众点评APP内商家团购数据,主要包括商家名称,地点,团购券(价格,购买量),团购套餐(价格,购买量),限定上海浦东区人气前100名美食商家,如下图所示。可使用python/java/RPA等方案,完成之后请提交源代码及结果Excel文件至邮箱xxx@xxx.com,文件名以“姓名-暑期实习笔试”命名。

 本人没有接触过爬虫,也不会使用python,网上也没太有关于java爬虫的案例,所以打算写一下java实现爬虫的过程。

准备工作

使用技术HttpClient+Jsoup,创建maven工程导入依赖

<dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.14</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.16.1</version>
        </dependency>

爬取网页

 该页面满足题目需求,浦东新区按人气排名美食商家,F12控制台查看该页面请求信息

java爬虫获取设置cookie java的爬虫_html

 注意图中的请求信息,包括请求url,请求头等信息

// 创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("https://www.dianping.com/shanghai/ch10/r5o2");
            //设置请求头,模拟浏览器请求,应对反爬,运行前将以下信息改为本地浏览器信息
            httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36");
            httpGet.setHeader("Host", "www.dianping.com");
            httpGet.setHeader("Referer", "https://www.dianping.com/shop/H9kou5hqlEsRYWAs");
            httpGet.setHeader("Cookie", "cookie信息,太长,省略,自己加进去");
            // 获取响应结果
            CloseableHttpResponse response = httpClient.execute(httpGet);
            String html = "1";

            if (response.getStatusLine().getStatusCode() == 200) {
                html = EntityUtils.toString(response.getEntity(), "UTF-8");
                //打印抓取到的html页面源代码
                System.out.println(html);
            }

 运行代码,可以看到控制台打印抓取到的源代码

java爬虫获取设置cookie java的爬虫_java爬虫获取设置cookie_02

Jsoup应用

通过调用Jsoup的parse(String html)方法即可将原始的HTML页面解析为Document类,这样我们就能够通过getElementById(String attr)、getElementsByClass(String attr)、select(String classAttr)等方式获取页面中的标签元素。

Document document = Jsoup.parse(html);
            Element DIV_15 = document.getElementById("shop-all-list");
            Elements Every_shop = DIV_15.select("li");
            for (Element element : Every_shop) {
                //Elements nameAndLocation = element.getElementsByClass("txt");
                Elements name = element.select(".tit a[data-click-name='shop_title_click']");
                Elements location = element.select(".tag-addr a[data-click-name='shop_tag_region_click'] span");
                Elements tuangoujuan = element.select(".svr-info a[data-click-name='shop_info_groupdeal_click']");
                Business business = new Business();
                business.setName(name.attr("title"));
                business.setLocation(location.text());
                String cupon = "";

                System.out.println(num++ +":---"+name.attr("title") + "---" + location.text() + "---" + tuangoujuan.attr("title"));

getElementById、getElementByClass等操作类似于Jquery中的选择器

运行结果

java爬虫获取设置cookie java的爬虫_java爬虫获取设置cookie_03

写入Excel文件 

上面我们已经实现了抓取商家名称、商家地址、商家优惠券信息,下面实现如何写入Excel文件。

首先导入jar包

<dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>easyexcel</artifactId>
            <version>3.3.1</version>
        </dependency>

构建实体类,实现getter、setter方法

public class Business {
    @ExcelProperty("商家名称")
    private String name;
    @ExcelProperty("商家地点")
    private String location;
    @ExcelProperty("优惠卷信息")
    private String coupon;


    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getLocation() {
        return location;
    }

    public void setLocation(String location) {
        this.location = location;
    }

    public String getCoupon() {
        return coupon;
    }

    public void setCoupon(String coupon) {
        this.coupon = coupon;
    }
}

在业务类中加入写入excel的代码

private static List<Business> list = new ArrayList<>();

Business business = new Business();

//*
抓取过程
*/

business.setName(name.attr("title"));
business.setLocation(location.text());
business.setCoupon(cupon);
list.add(business);
String fileName = "D://top100.xls";
EasyExcel.write(fileName,Business.class).sheet("商家信息").doWrite(list);

至此,所有抓取到的内容均已写入excel文件

java爬虫获取设置cookie java的爬虫_html_04

 完整源代码

Buisiness.class

public class Business {
    @ExcelProperty("商家名称")
    private String name;
    @ExcelProperty("商家地点")
    private String location;
    @ExcelProperty("优惠卷信息")
    private String coupon;


    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getLocation() {
        return location;
    }

    public void setLocation(String location) {
        this.location = location;
    }

    public String getCoupon() {
        return coupon;
    }

    public void setCoupon(String coupon) {
        this.coupon = coupon;
    }
}

Test.class

public class Test {
    private static List<Business> list = new ArrayList<>();
    private static int i = 0;

    public static void main(String[] args) throws Exception {
        testLinked();
        // 写入存放路径
        String fileName = "D://Top100.xls";
        EasyExcel.write(fileName,Business.class).sheet("商家信息").doWrite(list);

    }
    public static void testLinked() throws Exception {
        // 创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        // 创建GET请求
        int num = 0;
        for(int p = 1; p < 8; p++) {
            //设置请求连接,请求头等信息
            HttpGet httpGet = new HttpGet("https://www.dianping.com/shanghai/ch10/r5o2p" + p);
            //设置请求头,模拟浏览器请求,应对反爬,运行前将以下信息改为本地浏览器信息
            httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36");
            httpGet.setHeader("Host", "www.dianping.com");
            httpGet.setHeader("Referer", "https://www.dianping.com/shop/H9kou5hqlEsRYWAs");
            httpGet.setHeader("Cookie", "_lxsdk_cuid=18850b52f64c8-051d5de180043f-26031a51-240480-18850b52f64c8; _lxsdk=18850b52f64c8-051d5de180043f-26031a51-240480-18850b52f64c8; _hc.v=32f6773c-3ca7-1952-2a67-6ba970c37282.1684981232; WEBDFPID=65y26693w6875x9410693z51y90uz61481149y4664597958z8yuw155-2000341232274-1684981231560ASMGYWEfd79fef3d01d5e9aadc18ccd4d0c95077598; qruuid=5209fa86-ead0-4f40-98d4-ac24e0167aca; dplet=eb502693c5046057657b0d1fc2285159; dper=96f5a5cc53d3c52b7a9008b5efa9090a1c5b6635be65875c7650abe104d5769e201cbe84273909fff28c3e9bfd4e77a197042624479e76d7f055f2eff4511e31; ll=7fd06e815b796be3df069dec7836c3df; ua=dpuser_5549070263; ctu=1faf3c6dfaf46f8a55956faeb0ccf7918ee5917c67e9551e20bb9042a61a5d1e; fspop=test; Hm_lvt_602b80cf8079ae6591966cc70a3940e7=1684981338; Hm_lvt_185e211f4a5af52aaffe6d9c1a2737f4=1684985896; Hm_lpvt_185e211f4a5af52aaffe6d9c1a2737f4=1684985940; _lx_utm=utm_source%3DBaidu%26utm_medium%3Dorganic; s_ViewType=10; Hm_lpvt_602b80cf8079ae6591966cc70a3940e7=1685071709; JSESSIONID=4E6E91FA6F600262FA3C4598513B150A; cy=2; cye=beijing; _lxsdk_s=18856cf7b3b-37a-0ca-311%7C1607827381%7C1");
            // 获取响应结果
            CloseableHttpResponse response = httpClient.execute(httpGet);
            String html = "1";

            if (response.getStatusLine().getStatusCode() == 200) {
                html = EntityUtils.toString(response.getEntity(), "UTF-8");
                System.out.println(html);
            }

            Document document = Jsoup.parse(html);
            Element DIV_15 = document.getElementById("shop-all-list");
            Elements Every_shop = DIV_15.select("li");
            for (Element element : Every_shop) {
                //Elements nameAndLocation = element.getElementsByClass("txt");
                Elements name = element.select(".tit a[data-click-name='shop_title_click']");
                Elements location = element.select(".tag-addr a[data-click-name='shop_tag_region_click'] span");
                Elements tuangoujuan = element.select(".svr-info a[data-click-name='shop_info_groupdeal_click']");
                Business business = new Business();
                business.setName(name.attr("title"));
                business.setLocation(location.text());
                String cupon = "";

                System.out.println(num++ +":---"+name.attr("title") + "---" + location.text() + "---" + tuangoujuan.attr("title"));
                for (Element t : tuangoujuan) {
                    //System.out.println(t.attr("title"));
                    //拼接优惠券字符串
                    cupon += t.attr("title");
                }
                business.setCoupon(cupon);
                list.add(business);
                //由于一页有15条记录,我爬取了七页,共105条记录,设置i标记只取前100条
                if(++i == 100) break;
                System.out.println("-----");
            }

            response.close();
        }
        httpClient.close();


    }
}