前言:

  • 环境配置:windows64、python3.4
  • requests库基本操作:

1、安装:pip install requests

2、功能:使用 requests 发送网络请求,可以实现跟浏览器一样发送各种HTTP请求来获取网站的数据。

3、命令集操作:

import requests  # 导入requests模块

r = requests.get("https://api.github.com/events") # 获取某个网页

# 设置超时,在timeout设定的秒数时间后停止等待响应
r2 = requests.get("https://api.github.com/events", timeout=0.001)

payload = {'key1': 'value1', 'key2': 'value2'}
r1 = requests.get("http://httpbin.org/get", params=payload)

print(r.url) # 打印输出url

print(r.text) # 读取服务器响应的内容

print(r.encoding) # 获取当前编码

print(r.content) # 以字节的方式请求响应体

print(r.status_code) # 获取响应状态码
print(r.status_code == requests.codes.ok) # 使用内置的状态码查询对象

print(r.headers) # 以一个python字典形式展示的服务器响应头
print(r.headers['content-type']) # 大小写不敏感,使用任意形式访问这些响应头字段

print(r.history) # 是一个response对象的列表

print(type(r)) # 返回请求类型
  • BeautifulSoup4库基本操作:

1、安装:pip install BeautifulSoup4

2、功能:Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。

3、命令集操作:

1 import requests
2 from bs4 import BeautifulSoup
3 html_doc = """
4 <html><head><title>The Dormouse's story</title></head>
5 <body>
6 <p class="title"><b>The Dormouse's story</b></p>
7
8 <p class="story">Once upon a time there were three little sisters; and their names were
9 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
10 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
11 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
12 and they lived at the bottom of a well.</p>
13
14 <p class="story">...</p>
15 """
16
17 ss = BeautifulSoup(html_doc,"html.parser")
18 print (ss.prettify()) #按照标准的缩进格式的结构输出
19 print(ss.title) # <title>The Dormouse's story</title>
20 print(ss.title.name) #title
21 print(ss.title.string) #The Dormouse's story
22 print(ss.title.parent.name) #head
23 print(ss.p) #<p class="title"><b>The Dormouse's story</b></p>
24 print(ss.p['class']) #['title']
25 print(ss.a) #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
26 print(ss.find_all("a")) #[。。。]
29 print(ss.find(id = "link3")) #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
30
31 for link in ss.find_all("a"):
32 print(link.get("link")) #获取文档中所有<a>标签的链接
33
34 print(ss.get_text()) #从文档中获取所有文字内容
1 import requests
2 from bs4 import BeautifulSoup
3
4 html_doc = """
5 <html><head><title>The Dormouse's story</title></head>
6 <body>
7 <p class="title"><b>The Dormouse's story</b></p>
8 <p class="story">Once upon a time there were three little sisters; and their names were
9 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
10 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
11 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
12 and they lived at the bottom of a well.</p>
13
14 <p class="story">...</p>
15 """

16 soup = BeautifulSoup(html_doc, 'html.parser') # 声明BeautifulSoup对象
17 find = soup.find('p') # 使用find方法查到第一个p标签
18 print("find's return type is ", type(find)) # 输出返回值类型
19 print("find's content is", find) # 输出find获取的值
20 print("find's Tag Name is ", find.name) # 输出标签的名字
21 print("find's Attribute(class) is ", find['class']) # 输出标签的class属性值
22
23 print(find.string) # 获取标签中的文本内容
24
25 markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
26 soup1 = BeautifulSoup(markup, "html.parser")
27 comment = soup1.b.string
28 print(type(comment)) # 获取注释中内容
  • 小试牛刀:
1 import requests
2 import io
3 import sys
4 sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') #改变标准输出的默认编码
5
6 r = requests.get('https://unsplash.com') #像目标url地址发送get请求,返回一个response对象
7
8 print(r.text) #r.text是http response的网页HTML