如果网站不需要登录,直接抓取即可;如果网站需要登录,请登录后,再抓取网页。
实现代码如下:
/**
* 抓取页面的子程序,返回HTML字符串
* @param httpClient
* @param pageNumber
* @return
* @throws Exception
*/
private String grabPage(CloseableHttpClient httpClient, int pageNumber) throws Exception {
HttpGet httpGet = new HttpGet(DETAIL_PAGE_PREFIX + "?id=" + pageNumber);
httpGet.setHeader("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36");
// 执行请求
CloseableHttpResponse response = httpClient.execute(httpGet);
// 接收结果
HttpEntity entity = response.getEntity();
String html = EntityUtils.toString(entity, "utf-8");
// 关闭连接
response.close();
return html;
}
上述代码传入的CloseableHttpClient为登录后的CloseableHttpClient,如果网站不需要登录,自己创建一个即可。比如:
CloseableHttpClient httpClient = HttpClients.createDefault();