前言
前几天有个面试,发了个题,要求
编程题:抓取大众点评APP内商家团购数据,主要包括商家名称,地点,团购券(价格,购买量),团购套餐(价格,购买量),限定上海浦东区人气前100名美食商家,如下图所示。可使用python/java/RPA等方案,完成之后请提交源代码及结果Excel文件至邮箱xxx@xxx.com,文件名以“姓名-暑期实习笔试”命名。
本人没有接触过爬虫,也不会使用python,网上也没太有关于java爬虫的案例,所以打算写一下java实现爬虫的过程。
准备工作
使用技术HttpClient+Jsoup,创建maven工程导入依赖
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.14</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.16.1</version>
</dependency>
爬取网页
该页面满足题目需求,浦东新区按人气排名美食商家,F12控制台查看该页面请求信息
注意图中的请求信息,包括请求url,请求头等信息
// 创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("https://www.dianping.com/shanghai/ch10/r5o2");
//设置请求头,模拟浏览器请求,应对反爬,运行前将以下信息改为本地浏览器信息
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36");
httpGet.setHeader("Host", "www.dianping.com");
httpGet.setHeader("Referer", "https://www.dianping.com/shop/H9kou5hqlEsRYWAs");
httpGet.setHeader("Cookie", "cookie信息,太长,省略,自己加进去");
// 获取响应结果
CloseableHttpResponse response = httpClient.execute(httpGet);
String html = "1";
if (response.getStatusLine().getStatusCode() == 200) {
html = EntityUtils.toString(response.getEntity(), "UTF-8");
//打印抓取到的html页面源代码
System.out.println(html);
}
运行代码,可以看到控制台打印抓取到的源代码
Jsoup应用
通过调用Jsoup的parse(String html)方法即可将原始的HTML页面解析为Document类,这样我们就能够通过getElementById(String attr)、getElementsByClass(String attr)、select(String classAttr)等方式获取页面中的标签元素。
Document document = Jsoup.parse(html);
Element DIV_15 = document.getElementById("shop-all-list");
Elements Every_shop = DIV_15.select("li");
for (Element element : Every_shop) {
//Elements nameAndLocation = element.getElementsByClass("txt");
Elements name = element.select(".tit a[data-click-name='shop_title_click']");
Elements location = element.select(".tag-addr a[data-click-name='shop_tag_region_click'] span");
Elements tuangoujuan = element.select(".svr-info a[data-click-name='shop_info_groupdeal_click']");
Business business = new Business();
business.setName(name.attr("title"));
business.setLocation(location.text());
String cupon = "";
System.out.println(num++ +":---"+name.attr("title") + "---" + location.text() + "---" + tuangoujuan.attr("title"));
getElementById、getElementByClass等操作类似于Jquery中的选择器
运行结果
写入Excel文件
上面我们已经实现了抓取商家名称、商家地址、商家优惠券信息,下面实现如何写入Excel文件。
首先导入jar包
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>easyexcel</artifactId>
<version>3.3.1</version>
</dependency>
构建实体类,实现getter、setter方法
public class Business {
@ExcelProperty("商家名称")
private String name;
@ExcelProperty("商家地点")
private String location;
@ExcelProperty("优惠卷信息")
private String coupon;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getLocation() {
return location;
}
public void setLocation(String location) {
this.location = location;
}
public String getCoupon() {
return coupon;
}
public void setCoupon(String coupon) {
this.coupon = coupon;
}
}
在业务类中加入写入excel的代码
private static List<Business> list = new ArrayList<>();
Business business = new Business();
//*
抓取过程
*/
business.setName(name.attr("title"));
business.setLocation(location.text());
business.setCoupon(cupon);
list.add(business);
String fileName = "D://top100.xls";
EasyExcel.write(fileName,Business.class).sheet("商家信息").doWrite(list);
至此,所有抓取到的内容均已写入excel文件
完整源代码
Buisiness.class
public class Business {
@ExcelProperty("商家名称")
private String name;
@ExcelProperty("商家地点")
private String location;
@ExcelProperty("优惠卷信息")
private String coupon;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getLocation() {
return location;
}
public void setLocation(String location) {
this.location = location;
}
public String getCoupon() {
return coupon;
}
public void setCoupon(String coupon) {
this.coupon = coupon;
}
}
Test.class
public class Test {
private static List<Business> list = new ArrayList<>();
private static int i = 0;
public static void main(String[] args) throws Exception {
testLinked();
// 写入存放路径
String fileName = "D://Top100.xls";
EasyExcel.write(fileName,Business.class).sheet("商家信息").doWrite(list);
}
public static void testLinked() throws Exception {
// 创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
// 创建GET请求
int num = 0;
for(int p = 1; p < 8; p++) {
//设置请求连接,请求头等信息
HttpGet httpGet = new HttpGet("https://www.dianping.com/shanghai/ch10/r5o2p" + p);
//设置请求头,模拟浏览器请求,应对反爬,运行前将以下信息改为本地浏览器信息
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36");
httpGet.setHeader("Host", "www.dianping.com");
httpGet.setHeader("Referer", "https://www.dianping.com/shop/H9kou5hqlEsRYWAs");
httpGet.setHeader("Cookie", "_lxsdk_cuid=18850b52f64c8-051d5de180043f-26031a51-240480-18850b52f64c8; _lxsdk=18850b52f64c8-051d5de180043f-26031a51-240480-18850b52f64c8; _hc.v=32f6773c-3ca7-1952-2a67-6ba970c37282.1684981232; WEBDFPID=65y26693w6875x9410693z51y90uz61481149y4664597958z8yuw155-2000341232274-1684981231560ASMGYWEfd79fef3d01d5e9aadc18ccd4d0c95077598; qruuid=5209fa86-ead0-4f40-98d4-ac24e0167aca; dplet=eb502693c5046057657b0d1fc2285159; dper=96f5a5cc53d3c52b7a9008b5efa9090a1c5b6635be65875c7650abe104d5769e201cbe84273909fff28c3e9bfd4e77a197042624479e76d7f055f2eff4511e31; ll=7fd06e815b796be3df069dec7836c3df; ua=dpuser_5549070263; ctu=1faf3c6dfaf46f8a55956faeb0ccf7918ee5917c67e9551e20bb9042a61a5d1e; fspop=test; Hm_lvt_602b80cf8079ae6591966cc70a3940e7=1684981338; Hm_lvt_185e211f4a5af52aaffe6d9c1a2737f4=1684985896; Hm_lpvt_185e211f4a5af52aaffe6d9c1a2737f4=1684985940; _lx_utm=utm_source%3DBaidu%26utm_medium%3Dorganic; s_ViewType=10; Hm_lpvt_602b80cf8079ae6591966cc70a3940e7=1685071709; JSESSIONID=4E6E91FA6F600262FA3C4598513B150A; cy=2; cye=beijing; _lxsdk_s=18856cf7b3b-37a-0ca-311%7C1607827381%7C1");
// 获取响应结果
CloseableHttpResponse response = httpClient.execute(httpGet);
String html = "1";
if (response.getStatusLine().getStatusCode() == 200) {
html = EntityUtils.toString(response.getEntity(), "UTF-8");
System.out.println(html);
}
Document document = Jsoup.parse(html);
Element DIV_15 = document.getElementById("shop-all-list");
Elements Every_shop = DIV_15.select("li");
for (Element element : Every_shop) {
//Elements nameAndLocation = element.getElementsByClass("txt");
Elements name = element.select(".tit a[data-click-name='shop_title_click']");
Elements location = element.select(".tag-addr a[data-click-name='shop_tag_region_click'] span");
Elements tuangoujuan = element.select(".svr-info a[data-click-name='shop_info_groupdeal_click']");
Business business = new Business();
business.setName(name.attr("title"));
business.setLocation(location.text());
String cupon = "";
System.out.println(num++ +":---"+name.attr("title") + "---" + location.text() + "---" + tuangoujuan.attr("title"));
for (Element t : tuangoujuan) {
//System.out.println(t.attr("title"));
//拼接优惠券字符串
cupon += t.attr("title");
}
business.setCoupon(cupon);
list.add(business);
//由于一页有15条记录,我爬取了七页,共105条记录,设置i标记只取前100条
if(++i == 100) break;
System.out.println("-----");
}
response.close();
}
httpClient.close();
}
}