python--爬虫

原创

天子骄龙 2022-02-24 16:42:30 ©著作权

©著作权归作者所有：来自51CTO博客作者天子骄龙的原创作品，请联系作者获取转载授权，否则将追究法律责任

python--爬虫_解析器

import requests

response=requests.get('https://www.autohome.com.cn/news/')  #发出http请求
#<Response [200]>

response.encoding='gbk'  #编码转换

#response.text  是返回的内容--html文本---是字符串
#res=response.content  #是返回的内容--字节形式
#print(response.text)

from bs4 import BeautifulSoup
#bs4 全名 BeautifulSoup，是编写 python 爬虫常用库之一，主要用来解析 html 标签
#安装 pip3 install Beautifulsoup4

soup = BeautifulSoup(response.text, "html.parser")  #对html进行解析
#两个参数：第一个参数是要解析的html文本，第二个参数是使用那种解析器，对于HTML来讲就是html.parser，这个是bs4自带的解析器。
#如果一段HTML或XML文档格式不正确的话，那么在不同的解析器中返回的结果可能是不一样的

#x=obj.find(name='a',id='i1')  #找出id='i1'的a标签--返回第一个匹配成功的标签
#在html中id是不会重复的

#x=obj.find(name='a')   #找出a标签--返回第一个匹配成功的标签
#<a class="orangelink" href="//www.autohome.com.cn/beijing/cheshi/" target="_blank"><i class="topbar-icon topbar-icon16 topbar-icon16-building"></i>½øÈë±±¾©³µÊÐ</a>
#对x这个标签，还可继续寻找其它标签

#x=obj.find_all(name='a')   #找出所有匹配成功的a标签
#返回一个列表
#print('标签',x)

tag=soup.find(id='auto-channel-lazyload-article')  #寻找id='auto-channel-lazyload-article',返回匹配成功的第一个
#tag=soup.find(name='h3',attrs={'class':'xxx','id':'xxx'})  #find格式
#tag=soup.find(name='h3',class_='xxx')  #find格式
#class_   是类

h3=tag.find_all(name='h3')

print(h3)