java爬虫 javascript java爬虫是做什么的

转载

IT剑客风云 2023-10-19 19:53:33

文章标签 java爬虫 javascript java 爬虫 java爬虫数据 html 文章分类 Java 后端开发

1. 网络爬虫概述

. 什么是爬虫

简单的说，网络爬虫就是使用程序模拟人浏览网页的行为，并把看到的数据采集并整理下来。从功能上讲，爬虫程序一般分为三个步骤，采集，处理，存储。爬虫从一个或若干初始网页的URL开始，获得原始页面数据；针对页面内容进行分析并筛选页面的有效数据；把数据整理并持久化。

. 爬虫的作用

搜索引擎：爬虫自动地采集互联网中的信息，采集回来后进行相应的存储或处理，在需要检索某些信息的时候，只需在采集回来的信息中进行检索，即实现了私人的搜索引擎。当然还需要有其他技术的支持，爬虫只是解决原始数据问题。

数据对比：例如很多商品在各大电商网站的平台上都有出售。可能每个平台的零售价都不一样，那么就可以获取每个电商网站的商品售价数据。类似的应用场景还非常多，例如收集招聘信息，收集音视频网站即将下架的影视作品。

写在最后：本文仅仅致力于技术方面的研究。对于爬虫的应用需要注意相关的法律法规。

2. 程序入门

. 爬取数据的原理

以前是使用浏览器获取页面数据，使用爬虫就是模拟人打开浏览器访问服务器的过程。程序获取数据以后对页面数据进行分析并解析存储。

java爬虫 javascript java爬虫是做什么的_java 爬虫

. 获取页面数据

对于数据的获取此处建议使用HttpClient技术，它实现了所有 HTTP 的方法(GET,POST,PUT,HEAD 等)、支持自动转向、支持 HTTPS 协议、支持代理服务器、支持Cookie。

相关依赖

<!-- httpcomponents依赖，包含HttpClient --> <dependency> <groupId> <artifactId>httpclient</artifactId> <version>4.5.3</version> </dependency> <!-- 日志 --> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.7.25</version> </dependency> <!-- 工具类 --> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>3.</version> </dependency> <dependency> <groupId>commons-io</groupId> <artifactId>commons-io</artifactId> <version>2.6</version> </dependency> <!-- jsoup --> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>0.3</version> </dependency>

<!-- httpcomponents依赖，包含HttpClient -->
<dependency>
        <groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.3</version>
</dependency>
<!-- 日志 -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<!-- 工具类 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.6</version>
</dependency>
<!-- jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>0.3</version>
</dependency>

获取数据：

public static void main(String[] args) throws Exception { // 创建HttpClient CloseableHttpClient httpClient = (); // 声明HttpGet请求对象 // HttpGet httpGet = new HttpGet(""); // 设置请求头包含User-Agent httpGet.setHeader("User-Agent", ""); // 使用HttpClient发请求,会返回response CloseableHttpResponse response = (httpGet); // 判断响应状态码是否是200 if (().getStatusCode() == 200) { // 判断是否有返回数据 if (() != null) { // 如果是200，则请求成功，解析返回的数据 String html = ((), "UTF-8"); // 把结果输出到文件 // 第一个参数是输出的文件，第二个参数是输出的内容，第二个参数是编码 (new File("C:/Users/tree/Desktop/"), html, "UTF-8"); // 解析页面打印文档 ( (html,"<div class="job-sec">", "</div>")); } } }

public static void main(String[] args) throws Exception {
// 创建HttpClient
CloseableHttpClient httpClient = ();
// 声明HttpGet请求对象
// 
HttpGet httpGet = new HttpGet("");
// 设置请求头包含User-Agent
httpGet.setHeader("User-Agent", "");
// 使用HttpClient发请求,会返回response
CloseableHttpResponse response = (httpGet);
// 判断响应状态码是否是200
if (().getStatusCode() == 200) {
// 判断是否有返回数据
if (() != null) {
// 如果是200，则请求成功，解析返回的数据
String html = ((), "UTF-8");
// 把结果输出到文件
// 第一个参数是输出的文件，第二个参数是输出的内容，第二个参数是编码
(new File("C:/Users/tree/Desktop/"), html, "UTF-8");
// 解析页面打印文档
(
(html,"<div class="job-sec">", "</div>"));
}
}
}

3. Jsoup解析

. 什么是Jsoup

jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

. 常用的API

3.. 查找元素

· getElementById(String id)

· getElementsByTag(String tag)
· getElementsByClass(String className)
· getElementsByAttribute(String key) (and related methods)
· Element siblings: siblingElements(), firstElementSibling(), lastElementSibling();nextElementSibling(), previousElementSibling()
· Graph: parent(), children(), child(int index)

3.. 元素数据

· attr(String key)获取属性attr(String key, String value)设置属性
· attributes()获取所有属性
· id(), className() and classNames()
· text()获取文本内容text(String value) 设置文本内容
· html()获取元素内HTMLhtml(String value)设置元素内的HTML内容
· outerHtml()获取元素外HTML内容
· data()获取数据内容(例如：script和style标签)
· tag() and tagName()

.3. 操作HTML和文本

· append(String html), prepend(String html)
· appendText(String text), prependText(String text)
· appendElement(String tagName), prependElement(String tagName)
· html(String value)

. 解析数据

java爬虫 javascript java爬虫是做什么的_java爬虫_02

从刚刚下载的中解析数据

`public static void main(String[] args) throws Exception { Document doc = (new File("C:/Users/tree/Desktop/"), "UTF-8"); // 使用dom方式获取数据 Element element = ("job-sec").child(0); (()); }`

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。