python 爬虫寻找class python爬虫搜索

转载

mob64ca14079fb3 2023-10-24 21:56:40

文章标签 python 爬虫寻找class python 爬虫 BeautifulSoup ci 文章分类 Python 后端开发

搜索文档树

1、Beautiful Soup定义了很多搜索方法,这里着重介绍2个: find() 和 find_all()

2、使用find_all()类似的方法可以查找到想要查找的文档内容

3、任意BeautifulSoup对象或Tag对象都可以调用 find() 和 find_all()方法来查询其下面的标签

过滤器

1、介绍find_all()方法前,先介绍一下过滤器的类型,这些过滤器贯穿整个搜索的API。过滤器可以被用在tag的name中,节点的属性中,字符串中或他们的混合中

2、过滤器只能作为搜索文档的参数,或者说应该叫参数类型更为贴切(即需要查找什么，就将其作为find_all()类似方法的参数)

字符串

最简单的过滤器是字符串(标签对名)。在搜索方法中传入一个字符串参数,BeautifulSoup会查找与字符串完整匹配的内容

例1：查找文档中所有的<b>标签

from bs4 import BeautifulSoup #导入bs4库

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器

tag_p = soup.find_all("p")
print(tag_p)

tag_b = soup.find_all("b")#b标签对是内嵌在第一个p标签对中的
print(tag_b)

tag_a = soup.find_all("a")#a标签对是内嵌在第二个p标签对中的
print(tag_a)

"""
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>,
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

[<b>The Dormouse's story</b>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""

注：由上面的例子可以看出
1、输入结果为所有符合要求的标签对组成的列表(元素的类型为tag对象)，每一对符合要求的标签对为列表中的一个元素

2、不论标签对中有什么，只要符合查找要求都会将其整个输出：a标签对中内嵌了b标签对，在查找a标签对时，也会把a中内嵌的b标签对一起输出(当然内嵌的b标签对可能是整个b标签对中的一部分)

3、a标签对中内嵌了b标签对，在查找b标签对时：只会输入符合要求的b标签对，不会输入用于内嵌b的a标签对

4、如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错

5、我们遍历列表后就可以得到一个一个的类型为tag对象的标签对，因此我们也可以对其使用tag对象的方法
例1_1:

for i in tag_a:
    print(i,type(i))
    print(soup.a.get("href"))
    
"""
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <class 'bs4.element.Tag'>
http://example.com/elsie
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <class 'bs4.element.Tag'>
http://example.com/elsie
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> <class 'bs4.element.Tag'>
http://example.com/elsie
"""

正则表达式

如果传入正则表达式作为参数。Beautiful Soup会通过正则表达式的match()来匹配内容
例2：找出所有以b开头的标签

from bs4 import BeautifulSoup #导入bs4库
import re
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器
tag_b = soup.find_all(re.compile("^b"))#返回的也是一个列表

for i in tag_b:
    print(i,type(i))
    print(i.name)

"""
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body> <class 'bs4.element.Tag'>
body
<b>The Dormouse's story</b> <class 'bs4.element.Tag'>
b

"""

注：由上面的例子可以看出
1、find_all()的搜索条件(过滤器)为正则表达式(以b开头的标签对)，则在整个HTML文件中符合条件的有body标签对和b标签对，因此分别输出了两个标签对的内容

2、返回的类型为tag对象，因此我们可以使用tag对象的方法

列表

如果传入列表参数。Beautiful Soup会将与列表中任一元素匹配的内容返回
例3：找到文档中所有<a>标签和<b>标签

from bs4 import BeautifulSoup #导入bs4库

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器
tag_a_b = soup.find_all(["a","b"])#返回的也是一个列表

print(tag_a_b,type(tag_a_b))

"""
[<b>The Dormouse's story</b>,
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 

<class 'bs4.element.ResultSet'>
"""

注：由上面的例子可以看出
1、需要查找多个标签对时，可以将需要查找的内容组成一个列表传到find_all()方法中作为过滤器

2、返回的结果是所有符合条件的标签对组成的列表，且其原始的类型也为tag对象

True

True可以匹配任何值。下面代码查找到所有的tag,但是不会返回字符串节点
例4：

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器
tag = soup.find_all(True)
print(tag)

#感觉这种方法用得不是很多，所以只是了解了下，知道有这种方法就好了

方法

1、如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数,如果这个方法返回True表示当前元素匹配并且被找到,如果不是则反回False

2、元素参数：HTML文档中的一个tag节点,不能是文本节点

例5：包含class属性却不包含id属性,那么将返回True

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

soup = BeautifulSoup(html,"lxml")

tag = soup.find_all(has_class_but_no_id)#这个方法作为参数传入find_all()方法
print(tag)

"""
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>,

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,

<p class="story">...</p>]
"""

注：
上面例子中的搜索条件为有class属性但不包含id属性，因此整个HTML中p标签对符合该条件(a标签对虽然不符合，但是其是内嵌在P标签对中的，因此在输入P时会有a)

find_all( )方法

语法：

find_all(name , attrs , recursive , text , **kwargs )

描述：

1、find_all()方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件

2、这里的使用方法感觉跟前面说的过滤器差不多，只是这里用的是标签对内中的属性，而过滤器用得是标签对的名字

name 参数

1、name 参数可以查找所有名字为name的tag,字符串对象会被自动忽略掉

2、搜索name参数的值可以使任一类型的过滤器 ,字符串,正则表达式,列表,方法或是True
例6：

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""


soup = BeautifulSoup(html,"lxml")#指定解析器

tag_title = soup.find_all("title")
print(tag_title)

tag_a = soup.find_all("a")
print(tag_a)

"""
[<title>The Dormouse's story</title>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""

注：
从上面的结果可以看出，其实这种方法跟前面说的过滤器是一样的，即name参数的值可以使任一类型的过滤器

keyword 参数

如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索
例：如果包含一个名字为id的参数,Beautiful Soup会搜索每个tag的”id”属性
例7：

from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""


soup = BeautifulSoup(html,"lxml")#指定解析器

tag_link = soup.find_all(id ="link2")#传入id参数
print(tag_link)

tag_href = soup.find_all(href=re.compile("example"))#传入href参数
print(tag_href)

tag_True = soup.find_all(id=True)#传入Trur参数
print(tag_True)

tag_all = soup.find_all(href=re.compile("example"), id='link1')#多个指定名字的参数
print(tag_all)

tag_class = soup.find_all(class_="sister")#传入class参数
print(tag_class)

"""
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""

注：
上面介绍了几种keyword 参数的搜索方式:搜索指定名字的属性时可以使用的参数值包括字符串 , 正则表达式 , 列表, True，各种参数间可以相互组合
   1、使用id关键字：包含一个名字为 id 的参数,Beautiful Soup会搜索每个tag的”id”属性
2、使用href关键字：如果传入href参数,Beautiful Soup会搜索每个tag的”href”属性
   3、使用True关键字：在文档树中查找所有包含 id 属性的tag,无论id的值是什么
   4、多个关键字组合：使用多个指定名字的参数可以同时过滤tag的多个属性
   5、使用class关键字：class是python的关键词，所以在使用其作为关键字时需要加个下划线
   6、多种过滤类型组合在一起可以进一步加强搜索(匹配)结果的准确性

按CSS搜索

1、按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字class在Python中是保留字,使用class做参数会导致语法错误。从Beautiful Soup的4.1.1版本开始。可以通过 class_ 参数搜索有指定CSS类名的tag(在上面例子中也有讲解)

例8：

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""


soup = BeautifulSoup(html,"lxml")#指定解析器

tag_class_1 = soup.find_all(class_="sister",id="link3")#class参数与id参数组合使用
print(tag_class_1)

"""
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""

2、class_ 参数同样接受不同类型的过滤器 ,字符串,正则表达式,方法或 True
例8_1:

from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器

tag_class_1 = soup.find_all(class_=re.compile("itl"))
print(tag_class_1)

#[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]

text 参数

1、通过 text 参数可以搜文档中的字符串内容与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True

例9：

from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器

more = soup.find_all(text=["Tillie", "Elsie", "Lacie"])
print(more)

all = soup.find_all(text=re.compile("story"))
print(all)

"""
['Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]
"""

2、虽然 text 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag.Beautiful Soup会找到 .string 方法与 text 参数值相符的tag
例9_1：

from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器

tag_a = soup.find_all("a",text= "Tillie")
print(tag_a)

#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

limit 参数

find_all() 方法返回全部的搜索结构，如果文档树很大那么搜索会很慢。如果我们不需要全部结果，可以使用 limit 参数限制返回结果的数量。效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果。

例10：文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量

soup.find_all("a", limit=2)

"""
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
"""

find( )方法

语法：

find( name , attrs , recursive , text , **kwargs )

描述
1、find_all()方法将返回文档中符合条件的所有tag,尽管有时候我们只想得到一个结果。比如文档中只有一个<body>标签，那么使用find_all()方法来查找<body>标签就不太合适, 使用find_all()方法并设置 limit=1 参数不如直接使用find()方法。
例11：下面两行代码是等价的

import re
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器

print(soup.find_all('title', limit=1))#返回一个列表


print(soup.find('title'))#返回一个tag

"""
[<title>The Dormouse's story</title>]
<title>The Dormouse's story</title>
"""

注：
1、上面两段代码：唯一的区别是find_all()方法的返回结果是值包含一个元素的列表(未设置limit参数时则是全部满足要求的标签对)，而find()方法直接返回结果

2、find_all() 方法没有找到目标是返回空列表, find()方法找不到目标时,返回 None

3、由输出结果可以看出find_all()方法返回的是一个列表，需要遍历后才是一个tag对象，而find()方法直接返回的就是一个tag对象

例：

from bs4 import BeautifulSoup  # 导入bs4库

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, "lxml")  # 指定解析器，创建beautifulsoup对象


p_string = soup.p.string
print(r"直接查找标签对中的string：",p_string)


p = soup.find_all("p")
print(r"标签对：",p)
for i in p:
    print(r"先查找标签对，再在标签对中找string：",i.string)
    print(r"先查找标签对，再在标签对中找某个属性的值：",i["class"])
    print(i.get("class"))

"""直接查找标签对中的string： The Dormouse's story
标签对： [<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
先查找标签对，再在标签对中找string： The Dormouse's story
先查找标签对，再在标签对中找某个属性的值： ['title']
['title']
先查找标签对，再在标签对中找string： None
先查找标签对，再在标签对中找某个属性的值： ['story']
['story']
先查找标签对，再在标签对中找string： ...
先查找标签对，再在标签对中找某个属性的值： ['story']
['story']
"""

注：

1、find_all()方法返回的是：一个所有符合查找条件的tag对象组成的列表，需要遍历后才是具体的某个tag对象

2、find()方法返回的是：第一个符合查找条件的tag对象，直接返回的就是一个tag对象

3、查找标签对中字符串的方法：

⑴直接使用"soup对象.标签对.string"的方法：这样查找出来的是第一个符合查找条件的标签对的字符串

⑵先找出所有符合查找条件的tag对象，在使用"tag对象.string"的方法：这样查找出来的就是全部符合条件的标签对的字符串

4、简析XML文档时，必须制定简析器为"xml"，不能是"lxml"，不然会报错

5、对于HTML文档和XML文档来说里面主要的就是：

⑴标签对：标签对里面的属性和属性值(key:value)。可通过找到的tag对象，再在tag对象中使用字典的方法，找出具体某个属性的值

⑵字符串：就是标签对之间的字符串，查找方法如3中所述

CSS选择器

1、Beautiful Soup支持大部分的CSS选择器，在Tag或BeautifulSoup对象的。select()方法中传入字符串参数，即可使用CSS选择器的语法找到tag

2、CSS选择器是一种单独的文档搜索语法, 参考 http://www.w3school.com.cn/css/css_selector_type.asp

3、CSS选择器的方法很多，这里重点介绍一种很常见的方法，其他方法请参考

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id87

步骤1：在原网页通过F12打开开发者模式，选中我们需要的东西，【右键】->copy->Copy Selector：复制我们需要的标签对的路径

python 爬虫寻找class python爬虫搜索_爬虫

步骤2：将路径粘贴在任意文本中(我们可以多复制几条，进行对比)，代码如下：

⑴#mainBox > main > div.article-list > div:nth-child(4) > h4 > a
    ⑵#mainBox > main > div.article-list > div:nth-child(5) > h4 > a

步骤3：由步骤2中的路径我们可以发现：不同的部分为"nth-child(num)",因此需要将冒号后(包括冒号)的部分删掉，就得到的通用的路径

#mainBox > main > div.article-list > div > h4 > a

例12：

import requests
from bs4 import BeautifulSoup

url = '********'
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}

html = requests.get(url,headers = header)

#使用自带的html.parser解析，速度慢但通用
soup = BeautifulSoup(html.text,'html.parser')

tag = soup.select("#mainBox > main > div.article-list > div > h4 > a")
print(tag)

拓展

测试HTML

<div class="postlist">
        <ul id="pins">
                  <li><a href="https://www.mzitu.com/198830" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/198830_14a48_236.jpg" 
           <li><a href="https://www.mzitu.com/189169" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/189169_11b11_236.jpg" 
         
           <li><a href="https://www.mzitu.com/190884" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/190884_20c59_236.jpg" 
           <li><a href="https://www.mzitu.com/190416" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/190416_18d11_236.jpg" 
           <li><a href="https://www.mzitu.com/190947" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/190947_21a24_236.jpg" 
           <li><a href="https://www.mzitu.com/190259" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/190259_18a38_236.jpg"
           <li><a href="https://www.mzitu.com/195585" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/195585_16a41_236.jpg" 
           <li><a href="https://www.mzitu.com/190177" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/190177_16e34_236.jpg"
           <li><a href="https://www.mzitu.com/191199" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/191199_22c42_236.jpg" 
           <li><a href="https://www.mzitu.com/190636" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/190636_19c12_236.jpg"
           <li><a href="https://www.mzitu.com/191054" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/191054_21c15_236.jpg" 
           <li><a href="https://www.mzitu.com/190302" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/190302_18b26_236.jpg" 
      
    <nav class="navigation pagination" role="navigation">
        
        <div class="nav-links"><span aria-current="page" class="page-numbers current">1</span>
<a class="page-numbers" href="https://www.mzitu.com/page/2/">2</a>
<a class="page-numbers" href="https://www.mzitu.com/page/3/">3</a>
<a class="page-numbers" href="https://www.mzitu.com/page/4/">4</a>
<span class="page-numbers dots">…</span>
<a class="page-numbers" href="https://www.mzitu.com/page/228/">228</a>
<a class="next page-numbers" href="https://www.mzitu.com/page/2/">下一页»</a></div>
    </nav>    </div>

例13:

import requests
from bs4 import BeautifulSoup

url = 'http://www.mzitu.com'
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}

html = requests.get(url,headers = header)

#使用自带的html.parser解析，速度慢但通用
soup = BeautifulSoup(html.text,'html.parser')

#方法1
"""
#实际上是第一个class = 'postlist'的div里的所有a 标签是我们要找的信息
all_a = soup.find('div',class_='postlist').find_all('a',target='_blank')

for a in all_a:
    print(a["href"])
"""
#方法2
all_a = soup.find_all('a',target="_blank")

for a in all_a:
    print(a["href"])

注：在上面例子中我们使用了两种方法去找符合('a',target="_blank")的标签对，可以发现两种方法的输出结果不一致

1、一个HTML页面中可能会有一些标签对包含我们不需要的信息：符合我们的查找条件，但是实际是我们不需要的

2、通过观察HTML页面可以发现，我们需要的信息都是在一个叫<div class="postlist">的标签对下面，因此我们可以先通过find()方法去返回这个tag对象，然后再在这个标签对对象中去使用find_all()方法去查找我们需要的标签对，其他在这个标签对(<div class="postlist")外但又符合方法2查找条件的标签对就不会被返回

注：

1、通过自己的学习，感觉经常用到的还是fing_all(标签对名参数，关键字参数)，当然这种查找当然使用fing_all(标签对名参数)。加上关键字参数可以提高准确性

2、本文是参照BeautifulSoup官方文档写的。只是自己在学习过程中的记录，方便以后查找的，文中肯定有错误的和遗漏的，如果有幸被您看到，请不要介意。可以直接去看官方文档

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id87

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：微星主板进入到bios会卡死什么原因微星主板进bios死机

下一篇：java cucumber报告 java汇报

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯