python爬虫中span python爬虫中的get函数作用

转载

gjnet 2023-10-24 21:37:16

文章标签 python爬虫中span html 数据解析数据 文章分类 Python 后端开发

Python爬虫篇（一）

初步认识爬虫

浏览器的工作原理
爬虫的工作原理
体验爬虫

BeautifulSoup解析数据、提取数据

解析数据
提取数据

find()和find_all()
Tag对象

使用对象的变化过程

初步认识爬虫

爬虫，从本质上来说，就是利用程序在网上拿到对我们有价值的数据。

爬虫能做很多事，能做商业分析，也能做生活助手，比如：分析北京近两年二手房成交均价是多少？深圳的Python工程师平均薪资是多少？北京哪家餐厅粤菜最好吃？等等。

浏览器的工作原理

python爬虫中span python爬虫中的get函数作用_数据

爬虫的工作原理

爬虫替我们做了提取数据和存储数据

python爬虫中span python爬虫中的get函数作用_python爬虫中span_02

爬虫四步走

python爬虫中span python爬虫中的get函数作用_python爬虫中span_03

体验爬虫

request.get()
这个方法的具体用法

import requests
#引入requests库
res = requests.get('URL')
#requests.get是在调用requests库中的get()方法，它向服务器发送了一个请求，括号里的参数是你需要的数据所在的网址，然后服务器对请求作出了响应。
#我们把这个响应返回的结果赋值在变量res上。

import requests 
#引入requests库
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/sanguo.md') 
#发送请求，并把响应结果赋值在变量res上

Response对象的常用属性
Response 常用的四个属性

python爬虫中span python爬虫中的get函数作用_python爬虫中span_04

体验爬虫，去爬一篇文章

#导包
import requests
#请求的地址
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/sanguo.md')
#设置响应的编码
res.encoding = 'utf-8'
#拿到的内容
content = res.text
#写操作
k = open('三国演义.txt', 'a+')
k.write(content)
k.close()

练习：将所有的请求状态码爬下来存到test.txt文件里面

import requests
response=requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/exercise/HTTP%E5%93%8D%E5%BA%94%E7%8A%B6%E6%80%81%E7%A0%81.md')
response.encoding='utf-8'
content=response.text
k=open('test.txt','a+')
k.write(content)
k.close()

BeautifulSoup解析数据、提取数据

解析数据

数据解析的用法

python爬虫中span python爬虫中的get函数作用_html_05

第一个参数是我们要解析的文本，是一个字符串类型的，第二个参数是解析器

引入BeautifulSoup库

from bs4 import BeautifulSoup

例子：

import requests
from bs4 import BeautifulSoup
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html') 
soup = BeautifulSoup( res.text,'html.parser')
print(type(soup)) #查看soup的类型
print(soup) # 打印soup

'''
打印结果
<class 'bs4.BeautifulSoup'>
html代码，太长了所以就不复制了
'''

soup的数据类型是<class 'bs4.BeautifulSoup'>，说明soup是一个BeautifulSoup对象。
response.text和soup打印出的内容表面上看长得一模一样,但是他们是属于不同的类。<class 'bs4.BeautifulSoup'>这个是被BeautifulSoup解析出来的对象

提取数据

find()和find_all()

find()与find_all()是BeautifulSoup对象的两个方法，它们可以匹配html的标签和属性，把BeautifulSoup对象里符合要求的数据都提取出来。

它俩的用法基本是一样的，区别在于，find()只提取首个满足要求的数据，而find_all()提取出的是所有满足要求的数据。

python爬虫中span python爬虫中的get函数作用_html_06

案例

#导包
import requests
from bs4 import BeautifulSoup
#获取数据
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')
#解析数据
html = res.text
#把网页解析为BeautifulSoup对象
soup = BeautifulSoup(html, 'html.parser')
#通过匹配拿到我们想要的数据
items = soup.find_all(class_='books')

#使用for 循环遍历
for item in items:
    print(item)

输出的结果是携带html标签的文本，接下来就是用到的Tag对象

Tag对象

Tag常用属性和方法

python爬虫中span python爬虫中的get函数作用_解析数据_07

使用对象的变化过程

从一开始的requests到BeautifulSoup这个过程的图如下

python爬虫中span python爬虫中的get函数作用_python爬虫中span_08

#导包
import requests
from bs4 import BeautifulSoup
#获取数据
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')
#解析数据
html = res.text
#把网页解析为BeautifulSoup对象，解析数据
soup = BeautifulSoup(html, 'html.parser')
#通过匹配拿到我们想要的数据，提取数据
items = soup.find_all(class_='books')

#使用for 循环遍历
for item in items:
    kind = item.find('h2')
    title = item.find(class_='title')
    brief = item.find(class_='info')
    print(kind.text, '\n', title.text, '\n', title['href'], '\n', brief.text)
    print(type(kind), type(title), type(brief))

有待更新…

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。