python用正则表达式爬取多个网页内容正则表达式爬虫

转载

flyingsmiling 2023-10-22 09:08:13

文章标签 正则表达式爬虫字符串 html 文章分类 Python 后端开发

介绍

正则表达式是处理字符串的一种很强大的工具，我们可以利用正则表达式自由地处理字符串。作为处理字符串的强大工具，正则表达式在处理爬虫的请求内容方面，有着巨大的作用。下面就从 Python 的五个函数开始，来简单介绍一下正则表达式以及它在处理爬虫请求方面的应用。

（一）re.match

最常规匹配

最简单的正则表达式匹配就是直接从头开始，匹配整个字符串，其中，使用 ^ 来表示字符串的开始，使用 $ 来表示字符串的结尾。下面使用正则表达式来匹配字符串 “Hello 123 456 rocketeerli this_is a regex demo”

代码：

import re
# 最常规匹配
content = "Hello 123 456 rocketeerli this_is a regex demo"
result = re.match("^Hello\s\d\d\d\s\d{3}\s\w{11}\s\w{7}.*demo$", content)
print(result)
print(result.group())
print(result.span())

注意：

group() ：返回匹配到的分组信息，分组是使用 () 括起来的部分，获得第一个分组就是 group(1)，不写参数，就是 group(0)，也就是返回整个匹配的结果。这里并没有分组，后面有对组的介绍。
span() ：返回(start()，end())，也就是返回字符串匹配到的下标信息。

运行结果：

<_sre.SRE_Match object; span=(0, 46), match=‘Hello 123 456 rocketeerli this_is a regex demo’>
Hello 123 456 rocketeerli this_is a regex demo
(0, 46)

泛匹配

最常规的匹配是非常不通用的，通常情况下，我们都是使用泛匹配，即用 .* 来进行匹配。也是直接匹配字符串，但这时候，利用 .* 我们可以少写很多字符，代码更简洁。

代码：

import re
content = "Hello 123 456 rocketeerli this_is a regex demo"
result = re.match("^Hello.*demo$", content)
print(result)
print(result.group())
print(result.span())

运行结果：

<_sre.SRE_Match object; span=(0, 46), match=‘Hello 123 456 rocketeerli this_is a regex demo’>
Hello 123 456 rocketeerli this_is a regex demo
(0, 46)

匹配组信息

正则表达式的组信息也是我们经常使用的，如果我们只需要一个字符串中的一部分，那么利用分组的方式将信息提取出来，就是一种非常高效的方法。下面的例子是取出第一个两边是空格的数字。

用法：把匹配目标用小括号括起来，并指定它的左右端点

代码：

import re
# 匹配目标 （把匹配目标用小括号括起来，并指定它的左右端点）
content = "Hello 123 456 rocketeerli this_is a regex demo"
result = re.match("^Hello\s(\d+)\s.*demo$", content)
print(result)
print(result.group())
print(result.group(1))
print(result.span())

运行结果：

<_sre.SRE_Match object; span=(0, 46), match=‘Hello 123 456 rocketeerli this_is a regex demo’>
Hello 123 456 rocketeerli this_is a regex demo
123
(0, 46)

贪婪匹配

如果不加任何限制，.* 表示的就是贪婪匹配，正如这个名字所表示的那样，它会尽可能多地进行匹配。

例子：

import re
content = "Hello 123456 rocketeerli this_is a regex demo"
result = re.match("^He.*(\d+).*demo$", content)
print(result.group())
print(result.group(1))
print(result.span())

运行结果：

Hello 123456 rocketeerli this_is a regex demo
6
(0, 45)

可以看到，虽然我们利用的是 (\d+) 去匹配数字，但由于前面有 .* 存在，它会尽可能多地去匹配，所以，留给 (\d+) 的数字只有一个。最后匹配到的组信息中，就只有一个数字了。

非贪婪匹配

贪婪匹配也是可以控制的，将 .* 变为 .*? 就变成了非贪婪匹配。与名称一样，它会匹配尽可能少的字符。

还是刚刚的例子，将 .* 变为 .*?：

import re
content = "Hello 123456 rocketeerli this_is a regex demo"
result = re.match("^He.*?(\d+).*demo$", content)
print(result.group())
print(result.group(1))
print(result.span())

运行结果：

Hello 123456 rocketeerli this_is a regex demo
123456
(0, 45)

可以看到，这次匹配的结果就不是只有一个数字了，而是将所有的数字都包括了。这就是非贪婪匹配

匹配模式

re 这个库中提供了四种匹配模式，其中最常使用的就是 re.S 这个模式。它最大的作用就是为 .* 服务。.* 正常来讲，不能匹配换行符，但加上 re.S 参数后就可以匹配所有字符了。

还是上面的字符串，不过这次我们在其中增加一个换行，对比，查看 re.S 是否有效：

如果不加 re.S:

代码：

import re
content = "Hello 123456 rocketeerli this \n is a regex demo"
result = re.match("^He.*?(\d+).*?demo$", content)
print(result)
print(result.group())
print(result.group(1))
print(result.span())

运行结果：

None
Traceback (most recent call last):
File “.\004_regex.py”, line 48, in
print(result.group())
AttributeError: ‘NoneType’ object has no attribute ‘group’

加上 re.S 后：

代码：

import re
content = "Hello 123456 rocketeerli this \n is a regex demo"
result = re.match("^He.*?(\d+).*?demo$", content, re.S)
print(result)
print(result.group())
print(result.group(1))
print(result.span())

运行结果：

<_sre.SRE_Match object; span=(0, 47), match=‘Hello 123456 rocketeerli this \n is a regex demo’>
Hello 123456 rocketeerli this
is a regex demo
123456
(0, 47)

这里加上了re.S，就能匹配换行符了。这在爬虫中，是经常使用的。

转义

这个与很多语言中的转义都是一样的，都是利用 \ 来表示。

例子，如要打印 $ 和 . 这两个符号：

content = "price is $5.00"
result = re.match("price is \$5\.00", content)
print(result)
print(result.group())

运行结果：

<_sre.SRE_Match object; span=(0, 14), match=‘price is $5.00’>
price is $5.00

（二） re.search

通常，我们是不需要匹配整个字符串的，可以进行局部匹配，search 方法就是其中最常用的方法之一。

re.search ：扫描整个字符串，并返回第一个成功的匹配。

例子：

content = "Hello 123456 rocketeerli this \n is a regex demo"
result = re.search("\s(\d+)\s", content, re.S)
print(result.group())
print(result.group(1))
print(result.span())

运行结果：

123456
123456
(5, 13)

匹配演练

下面是一些爬虫中的匹配练习。

首先，要有一个 HTML 格式的字符串。后面所有代码中的 html 字符，都统一使用下面这个字符串：

html = '''	<div id="songs-list"> 
		<h2 class="title">经典老歌</h2>
		<p class="introduction">
			经典老歌列表
		</p>
		<ul id="list" class="list-group">
			<li data-view="2">一路上有你</li>
			<li data-view="7">
				<a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
			</li>
			<li data-view="4" class="active">
				<a href="/3.mp3" singer="齐秦">往事随风</a>
			</li>
			<li data-view="6">
				<a href="/4.mp3" singer="beyond">光辉岁月</a>
			</li>
			<li data-view="5">
				<a href="/5.mp3" singer="陈慧琳">记事本</a>
			</li>
			<li data-view="5">
				<a href="/6.mp3" singer="邓丽君"><i ></i>但愿人长久</a>
			</li>
		</ul>
	</div>'''

首先，利用 search() 方法进行匹配：

result = re.search('<li.*?active.*?singer="(.*?)">(.*?)</a>', html, re.S)
if result:
    print(result.group(1), result.group(2))

运行结果：

齐秦往事随风

可以看到，这里直接返回了匹配的第一个结果，但如果我们想匹配所有的歌曲和歌手，这个就不太适用了。

（三） re.findall

该方法也是非常常用的一个方法：寻找所有满足条件的字符串，并返回 List 类型。

而 re.search 只返回一个

例子：

results = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>', html, re.S)
print(results)
print(type(results))    # 列表类型
for result in results :
    print(result)
    print(result[0], result[1], result[2])
results = re.findall('<li.*?>\s*?(<a.*?>)?(\w+)(</a>)?\s*?</li>', html, re.S)
print(results)

运行结果：

齐秦往事随风
任贤齐沧海一声笑
PS C:\Users\13144\Desktop\code-lab\srac> python .\004_regex.py
[(’/2.mp3’, ‘任贤齐’, ‘沧海一声笑’), (’/3.mp3’, ‘齐秦’, ‘往事随风’), (’/4.mp3’, ‘beyond’, ‘光辉岁月’), (’/5.mp3’, ‘陈慧琳’, ‘记事本’), (’/6.mp3’, ‘邓丽君’, ‘但愿人长久’)]
<class ‘list’>
(’/2.mp3’, ‘任贤齐’, ‘沧海一声笑’)
/2.mp3 任贤齐沧海一声笑
(’/3.mp3’, ‘齐秦’, ‘往事随风’)
/3.mp3 齐秦往事随风
(’/4.mp3’, ‘beyond’, ‘光辉岁月’)
/4.mp3 beyond 光辉岁月
(’/5.mp3’, ‘陈慧琳’, ‘记事本’)
/5.mp3 陈慧琳记事本
(’/6.mp3’, ‘邓丽君’, ‘但愿人长久’)
/6.mp3 邓丽君但愿人长久
[(’’, ‘一路上有你’, ‘’), (’’, ‘沧海一声笑’, ‘’), (’’, ‘往事随风’, ‘’), (’’, ‘光辉岁月’, ‘’), (’’, ‘记事本’, ‘’), (’’, ‘但愿人长久’, ‘’)]

找出所有的歌名信息：

for result in results :
    print(result[1])

输出结果：

一路上有你
沧海一声笑
往事随风
光辉岁月
记事本
但愿人长久

（四） re.sub

sub() 方法也是非常常用的方法，它的作用是 替换字符串中每一个匹配的子串，并返回替换后的字符串。

下面的例子是去除字符串中的数字：

content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
content = re.sub('\d+',"", content)
print(content)

运行结果：

Extra strings Hello World_This is a Regex Demo Extra strings

也可以替换成字符串：

content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
content = re.sub('\d+',"Replacement Data", content)
print(content)

运行结果：

Extra strings Hello World_This is a Regex Demo Extra strings

那么这样也就有一个问题，如果我们替换的字符串包含原来的字符串，那该怎么办呢？

我们可以使用 \1 来表示匹配的第一个组，在后面增加一串字符，代码如下：

content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
content = re.sub('(\d+)',r'\1 89055', content)
print(content)

这里的 \1 就是把第一个括号里的内容拿出来。

运行结果：

Extra strings Hello 1234567 89055 World_This is a Regex Demo Extra strings

利用 sub 和 findall 方法来查找歌曲名

html = re.sub('<a.*?>|</a>|<i.*?>|</i>', '', html)
results = re.findall('<li.*?>(.*?)</li>', html, re.S)
for result in results :
	print(result.strip())

注意这里的 strip() 方法，是用于移除字符串头尾指定的字符，默认为空格或换行符。这里就是将各个歌曲名前面的空格和换行去除。

运行结果：

一路上有你
沧海一声笑
往事随风
光辉岁月
记事本
但愿人长久

（五） re.compile

我们之前写的正则表达式的字符串都是直接在方法参数中书写的，后面如果要使用的话，还需要再次写一遍。而 compile() 方法就可以将正则表达式的字符串打包起来，实现复用的功能。

第一个参数传入正则表达式，第二个参数传入匹配模式，代码如下：

import re
content = '''Hello 1234567 World 
This is a Regex Demo
'''
pattern = re.compile('Hello.*Demo', re.S)
result = re.match(pattern, content)
print(result)

运行结果：

<_sre.SRE_Match object; span=(0, 41), match=‘Hello 1234567 World \nThis is a Regex Demo’>

运行的结果与不使用 compile 方法是一样的，使用这个方法最大的作用就是代码复用。而且会让代码更简洁，逻辑比较清楚。

实战练习

下面是一个爬取豆瓣网站图书列表的例子。

代码：

import re
import requests
content = requests.get('https://book.douban.com/').text
pattern = re.compile('<li.*?cover">\s+<a href="(.*?)" title="(.*?)">', re.S)
results = re.findall(pattern, content)
for result in results :
	link, name = result
	print(link + "\t" + name)

运行结果：

https://book.douban.com/subject/30376970/?icn=index-editionrecommend 追恐龙的男孩
https://book.douban.com/subject/30406658/?icn=index-editionrecommend 钱锺书交游考
https://book.douban.com/subject/30388816/?icn=index-editionrecommend 太阳全书
https://book.douban.com/subject/30408662/?icn=index-editionrecommend 拥有一个你说了算的人生. 活出自我篇
https://book.douban.com/subject/30418697/?icn=index-editionrecommend 变量
https://book.douban.com/subject/30380271/?icn=index-latestbook-subject 危险的维纳斯
https://book.douban.com/subject/30328192/?icn=index-latestbook-subject 吉尔·德勒兹
https://book.douban.com/subject/30370294/?icn=index-latestbook-subject 神秘
https://book.douban.com/subject/30389935/?icn=index-latestbook-subject 四个春天
https://book.douban.com/subject/30394589/?icn=index-latestbook-subject 月亮看见了
https://book.douban.com/subject/30331841/?icn=index-latestbook-subject 被诅咒的部分
https://book.douban.com/subject/30390651/?icn=index-latestbook-subject 时间的礼物
https://book.douban.com/subject/30409108/?icn=index-latestbook-subject 吃饭，流汗，玩耍
https://book.douban.com/subject/30358339/?icn=index-latestbook-subject 我如何成为一名畅销书作家
https://book.douban.com/subject/30331839/?icn=index-latestbook-subject T.S.艾略特传
https://book.douban.com/subject/30394212/?icn=index-latestbook-subject 暖气
https://book.douban.com/subject/30306723/?icn=index-latestbook-subject 埃及神话
https://book.douban.com/subject/30352058/?icn=index-latestbook-subject 世界诞生于午夜
https://book.douban.com/subject/30394606/?icn=index-latestbook-subject 太时髦了！
https://book.douban.com/subject/30394658/?icn=index-latestbook-subject 想我苦哈哈的一生
https://book.douban.com/subject/30358955/?icn=index-latestbook-subject 大师镜头昆汀篇
https://book.douban.com/subject/30401009/?icn=index-latestbook-subject 水妖
https://book.douban.com/subject/30394086/?icn=index-latestbook-subject 20世纪西方人类学主要著作指南
https://book.douban.com/subject/30324805/?icn=index-latestbook-subject 莎拉的钥匙
https://book.douban.com/subject/30283398/?icn=index-latestbook-subject 家族、土地与祖先
https://book.douban.com/subject/30409058/?icn=index-latestbook-subject 应物兄
https://book.douban.com/subject/30414743/?icn=index-latestbook-subject 显微镜下的大明
https://book.douban.com/subject/30328198/?icn=index-latestbook-subject 对岸的她
https://book.douban.com/subject/30401611/?icn=index-latestbook-subject 金叶 : 来自金枝的故事
https://book.douban.com/subject/30297230/?icn=index-latestbook-subject 丝
https://book.douban.com/subject/30384531/?icn=index-latestbook-subject 秩序
https://book.douban.com/subject/30374824/?icn=index-latestbook-subject 绘星者
https://book.douban.com/subject/30389970/?icn=index-latestbook-subject 写作的禅机
https://book.douban.com/subject/30400605/?icn=index-latestbook-subject 保龄球的意识流
https://book.douban.com/subject/30376593/?icn=index-latestbook-subject 当下的启蒙
https://book.douban.com/subject/30415084/?icn=index-latestbook-subject 长长的回家路
https://book.douban.com/subject/30247860/?icn=index-latestbook-subject 整个巴黎属于我
https://book.douban.com/subject/30397755/?icn=index-latestbook-subject 2018年中国悬疑小说精选
https://book.douban.com/subject/30383926/?icn=index-latestbook-subject 表象与本质
https://book.douban.com/subject/30376530/?icn=index-latestbook-subject 夜逝之时

总结

正则表达式在解析爬虫内容方面有很大的应用。虽然我们可以使用 BeautifulSoup 等库来解析 HTML 文本，但正则表达式无疑更自由，而且配合一些 HTML 解析器，功能更强大，使用更方便。

本文是我在看爬虫入门视频时记的笔记，希望能够帮助大家，如有差错，望给予纠正，谢谢。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：appium iOS 真机测试 appium可以测试ios吗

下一篇：云计算三层架构有哪些云计算三层服务的关系

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯