jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于JQuery的操作方法来取出和操作数据。请参考: http://jsoup.org/
jsoup的主要功能如下:
从一个URL,文件或字符串中解析HTML;
使用DOM或CSS选择器来查找、取出数据;
可操作HTML元素、属性、文本;
jsoup是基于MIT协议发布的,可放心使用于商业项目。
下载和安装:
maven安装方法:
把下面放入pom.xml下
<dependency>
<!-- jsoup HTML parser library @ http://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.5.2</version>
</dependency>
用jsoup解析html的方法如下:
解析url html方法
Document
doc
=
Jsoup
.
connect
(
"http://example.com"
)
.
data
(
"query"
,
"Java"
)
.
userAgent
(
"Mozilla"
)
.
cookie
(
"auth"
,
"token"
)
.
timeout
(
3000
)
.
post
();
从文件中解析的方法:
File
input
=
new
File
(
"/tmp/input.html"
);
Document
doc
=
Jsoup
.
parse
(
input
,
"UTF-8"
,
"http://example.com/"
);
类试js jsoup提供下面方法:
-
getElementById(String id)
-
getElementsByTag(String tag)
-
getElementsByClass(String className)
-
getElementsByAttribute(String key)
同时还提供下面的方法提供获取兄弟节点:
siblingElements()
, firstElementSibling()
, lastElementSibling()
;nextElementSibling()
, previousElementSibling()
用下面方法获得元素的数据:
-
attr(String key)
获得元素的数据 -
attr(String key, String value)
t设置元素数据 -
attributes()
获得所以属性 -
id()
,className()
classNames()
-
text()
-
text(String value)
设置文本值 -
html()
获取html -
html(String value)
-
outerHtml()
获得内部html -
data()
-
tag()
获得tag 和tagName()
操作html提供了下面方法:
-
append(String html)
,prepend(String html)
-
appendText(String text)
,prependText(String text)
-
appendElement(String tagName)
,prependElement(String tagName)
-
html(String value)
通过类似jquery的方法操作html
File
input
=
new
File
(
"/tmp/input.html"
);
Document
doc
=
Jsoup
.
parse
(
input
,
"UTF-8"
,
"http://example.com/"
);
Elements
links
=
doc
.
select
(
"a[href]"
);
// a with href
Elements
pngs
=
doc
.
select
(
"img[src$=.png]"
);
// img with src ending .png
Element
masthead
=
doc
.
select
(
"div.masthead"
).
first
();
// div with class=masthead
Elements
resultLinks
=
doc
.
select
(
"h3.r > a"
);
// direct a after h3
支持的操作有下面这些:
-
tagname 操作tag
-
ns|tag ns或tag
-
#id 用id获得元素
-
.class 用class获得元素
-
[attribute] 属性获得元素
-
[^attr]
: 以attr开头的属性 -
[attr=value] 属性值为
value -
[attr^=value]
,[attr$=value]
,[attr*=value]
-
[attr~=regex]正则
-
*
:所以的标签
选择组合
-
el#id el和id定位
-
el.class e1和class定位
-
el[attr]
e1和属性定位 -
ancestor child
ancestor下面的 child
等等
抓取网站标题和内容及里面图片的事例:
1. public void
2. // 返回结果初始化。
3.
4. null
5. try
6. doc = Jsoup
7. .connect(urlStr)
8. .userAgent(
9. "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.15)" ) // 设置User-Agent
10. 5000 ) // 设置连接超时时间
11. .get();
12. catch
13. log.error( e);
14. return
15. catch
16. if (e instanceof
17. log.error( e);
18. return
19. }
20. if (e instanceof
21. log.error(e);
22. return
23. }
24. log.error( e);
25. return
26. }
27. system.out.println(doc.title());
28. Element head = doc.head();
29. "meta"
30. for
31. "content"
32. if ( "content-type" .equalsIgnoreCase(meta.attr( "http-equiv"
33. "text/html"
34. log.debug( urlStr);
35. return
36. }
37. if ( "description" .equalsIgnoreCase(meta.attr( "name"
38. "content"
39. }
40. }
41. Element body = doc.body();
42. for (Element img : body.getElementsByTag( "img"
43. "abs:src" ); //获得绝对路径
44. for
45. if (imageUrl.indexOf( "?" )> 0
46. 0 ,imageUrl.indexOf( "?"
47. }
48. if
49. imgSrcs.add(imageUrl);
50. break
51. }
52. }
53. }
54. }
这里重点要提的是怎么获得图片或链接的决定地址:
如上获得绝对地址的方法String imageUrl = img.attr("abs:src");//获得绝对路径 ,前面添加abs:jsoup就会获得决定地址;
想知道原因,咱们查看下源码,如下:
1. //该方面是先从map中找看是否有该属性key,如果有直接返回,如果没有检查是否
2. //以abs:开头
3. public
4. Validate.notNull(attributeKey);
5.
6. if
7. return
8. else if (attributeKey.toLowerCase().startsWith( "abs:"
9. return absUrl(attributeKey.substring( "abs:"
10. else return ""
11. }
接着查看absUrl方法:
1.
2.
3. /**
4. * Get an absolute URL from a URL attribute that may be relative (i.e. an <code><a href></code> or
5. * <code><img src></code>).
6. * <p/>
7. * E.g.: <code>String absUrl = linkEl.absUrl("href");</code>
8. * <p/>
9. * If the attribute value is already absolute (i.e. it starts with a protocol, like
10. * <code>http://</code> or <code>https://</code> etc), and it successfully parses as a URL, the attribute is
11. * returned directly. Otherwise, it is treated as a URL relative to the element's {@link #baseUri}, and made
12. * absolute using that.
13. * <p/>
14. * As an alternate, you can use the {@link #attr} method with the <code>abs:</code> prefix, e.g.:
15. * <code>String absUrl = linkEl.attr("abs:href");</code>
16. *
17. * @param attributeKey The attribute key
18. * @return An absolute URL if one could be made, or an empty string (not null) if the attribute was missing or
19. * could not be made successfully into a URL.
20. * @see #attr
21. * @see java.net.URL#URL(java.net.URL, String)
22. */
23. //看到这里大家应该明白绝对地址是怎么取的了
24. public
25. Validate.notEmpty(attributeKey);
26.
27. String relUrl = attr(attributeKey);
28. if
29. return "" ; // nothing to make absolute with
30. else
31. URL base;
32. try
33. try
34. new
35. catch
36. // the base is unsuitable, but the attribute may be abs on its own, so try that
37. new
38. return
39. }
40. // workaround: java resolves '//path/file + ?foo' to '//path/?foo', not '//path/file?foo' as desired
41. if (relUrl.startsWith( "?"
42. relUrl = base.getPath() + relUrl;
43. new
44. return
45. catch
46. return ""
47. }
48. }
49. }