BeautifulSoup，一碗美丽的汤，一个隐藏的大坑

原创

良思远行 2018-10-19 09:57:14 ©著作权

©著作权归作者所有：来自51CTO博客作者良思远行的原创作品，请联系作者获取转载授权，否则将追究法律责任

		python 网络爬虫常用的4大解析库助手：re正则、etree xpath、scrapy xpath、BeautifulSoup。（因为etree xpath和scrapy xpath用法上有较大的不同，故没有归为一类），本文来介绍BeautifulSoup一个少为人知的坑，见示例：
		例1(它是长得不一样， 柬文勿怪)：
				    content = """
    <html>
         <body>
          <div class="td-post-content td-pb-padding-side">
           <p>
            <img alt="" class="alignnone size-full wp-image-122426" 
        data-recalc-dims="1" height="352" 
        src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching.jpg?resize=630%2C352&amp;ssl=1" 
        width="630"/>
           </p>

           <p>
            <img alt="" class="alignnone size-full wp-image-122427" 
        data-recalc-dims="1" height="473" 
        src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&amp;ssl=1" 
        width="630"/>
           </p>
           <p>
            ចំណែកឯប្រេងដូងវិញ មានផ្ទុកអាស៊ីតខ្លាញ់អូមេហ្គា៣ 
        ដែលល្អបំផុតសម្រាប់បំផ្លាញ់មីក្រុបដែលមានវត្តមាននៅក្នុងតំបន់រន្ធគូថ 
        ហេតុនេះហើយទើបការឆ្លងមេរោគ និងរមាស់ត្រូវបានទប់ស្កាត់។
           </p>
           <p>

            <img alt="" class="alignnone size-full wp-image-122427" 
        data-recalc-dims="1" height="473" 
        src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&amp;ssl=1" 
        width="630"/>
           </p>
          
           <p>
            <img alt="" class="alignnone size-full wp-image-122428" 
        data-recalc-dims="1" height="473" 
        src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching2.jpg?resize=630%2C473&amp;ssl=1" 
        width="630"/>
            <br/>
            <em>
             <br/>
             ចំណាំ៖
            </em>
            ប្រសិនបើអ្នករមាស់ខ្លាំង មានការឈឺចាប់ ហើយមានឈាមហូរទៀតនោះ 
        ត្រូវប្រញាប់ទៅជួបជាមួយគ្រូពេទ្យភ្លាម៕
           </p>
          </div>
         </body>
        </html>
""" 
	soup = BeautifulSoup(content)
img_lst = []
inner_src_list = soup.find_all('img', src=True)
for i, src in enumerate(inner_src_list):
    url=src["src"].replace("&ssl", "&amp;ssl")
    print(url)

print(soup.prettify())
	# content = soup.prettify()    # src的打印结果一样
img_tags = soup.find_all('img')
for img in img_tags:
    print(img['src'])

控制台打印输出如下：
		![](http://i2.51cto.com/images/blog/201810/19/f709eed65fc5ebf49e98cc7cb67e6b91.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=)
		![](http://i2.51cto.com/images/blog/201810/19/3bda9857b63335670b3dcac69903aa74.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=)
		![](http://i2.51cto.com/images/blog/201810/19/9e41161d11fb22a9f01ec2868e870ead.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=)
		
		怎么会这样：文本中的‘amp;’字符怎么消失了？
		解释如下：BeautifulSoup在提取src时内部会自动把符号‘&amp;’转义成'&'，【网页解析有时不一定要眼前的直觉】【不仅bs如此， etree xpath和scrapy xpath也是一样】
		
		例2：
					文本同上
					    soup = BeautifulSoup(content)
						img_lst = []
					    inner_src_list = soup.find_all('img', src=True)       #  注意比较
                        for i, src in enumerate(inner_src_list):
                                   url=src["src"].replace("&ssl", "&amp;ssl")
                                  print(url)

                      inner_src_list = soup.find_all('img', attr={'src':True})     # 注意比较
                      for i, src in enumerate(inner_src_list):
                                  url=src["src"].replace("&ssl", "&amp;ssl")
                                  print(url)
							
							这里不作打印了，直接说明现象，第一个print正常打印，第二个print输出为空，为什么？
							解释如下： 第一个find_all，把src=True视为存在src属性的img标签，第二个find_all，把attr={'src', True}视为存在src且属性值为True的img标签，所以结果可想而知！
			
			上述如有不正之处，欢迎指出，谢谢！