简单使用:
python小例子链接:
https://python123.io/ws/demo.html
代码:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://python123.io/ws/demo.html")
print(r.text)
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
print(soup)
print(soup.prettify())
结果:
D:\python_install\python.exe D:/pycharmworkspace/temp1/crawler_1.py
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
Process finished with exit code 0
查看tag爸爸以及爷爷的标签名字:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://python123.io/ws/demo.html")
print("\n")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
tag_a = soup.a
print(soup.a.parent.name)#查看其父亲的名字!
print("\n")
print(soup.a.parent.parent.name)#查看其父亲的父亲的名字!
结果:
D:\python_install\python.exe D:/pycharmworkspace/temp1/crawler_1.py
p
body
Process finished with exit code 0
转换为字典之后,获取对应的值:
代码:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://python123.io/ws/demo.html")
print("\n")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
print(soup.a)#soup.tag tag就是你想要查看的标签类型!仅仅显示带有<a></a>标签的信息!
tag_a = soup.a
print("\n")
print(tag_a.attrs)#attrs:属性的意思
print("\n")
print(tag_a.attrs['id'])#获取href对应的值。
print("\n")
print(tag_a.attrs['href'])#获取href对应的值。
print("\n")
结果:
D:\python_install\python.exe D:/pycharmworkspace/temp1/crawler_1.py
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
link1
http://www.icourse163.org/course/BIT-268001
Process finished with exit code 0
HTML查看除网页标签之外字符串的方法:
代码:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://python123.io/ws/demo.html")
print("\n")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
print(soup.a)#soup.tag tag就是你想要查看的标签类型!仅仅显示带有<a></a>标签的信息!
tag_a = soup.a
print("\n")
print(soup.a.string)
print("\n")
print(soup.p)
print("\n")
print(soup.p.string)
结果:
D:\python_install\python.exe D:/pycharmworkspace/temp1/crawler_1.py
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
<p class="title"><b>The demo python introduces several python courses.</b></p>
The demo python introduces several python courses.
Process finished with exit code 0