爬取网站中elements 爬取网站数据代码

转载

云端筑梦师 2024-05-13 07:35:06

文章标签 爬取网站中elements python 爬虫 html 数据 文章分类 机器学习人工智能

Python网络爬虫获取网站楼盘数据

因为需要从网上抓取楼盘信息，所以研究了一下如何使用Python来实现这个功能。具体步骤如下：

第一步，获取包含楼盘数据的网页HTML源代码。使用urllib库来获取网页数据，代码如下：

from urllib import request
resp = request.urlopen(url)
html_data = resp.read().decode('utf-8')

其中url是要打开的网页的网址。执行之后，html_data是字符串类型的变量，用于保存了获取的网页HTML源代码。调用decode方法是为了进行utf-8编码。包含楼盘数据的网址为：https://cd.ixiangzhu.com/House/lists.html?page=1，楼盘太多，所以采用了分页显示，其中page=1中的1表示第一页，如果是第二页则为page=2，依次类推。

第二步，对获取的网页HTML源代码进行分析，从中提取楼盘数据。这个可以借助于BeautifulSoup包完成，利用BeautifulSoup，可以很方便地从网页中提取各种tag。

（1）根据对获取到的网页HTML源代码的分析，发现所有的楼盘数据是包含在<div class="house_lists">中，代码如下：

<div class="house_lists">
                                                <div class="house_item clearfix" data-temp-id="WJ000795">
                            <div class="f_left house_item_img">
                                <a href="https://cd.ixiangzhu.com/House/detail/WJ000795.html"><img src="https://www.51xiangzhu.com:6080/app/file/img.do?img=/usr/local/img/property/WJ000795/1320x240.JPG" width="230" height="160" /></a>
                            </div>
                            <div class="f_left house_item_info">
                                <div class="title clearfix">
                                    <a class="f_left" href="https://cd.ixiangzhu.com/House/detail/WJ000795.html">金科星耀天都</a>
                                                                         <span class="price f_right">
                                        <span class="money">11892
                                                                                    元/m²
                                                                                </span>
                                    </span>
                                                                        </div>
                                <div class="info">
                                    <span>在售</span>
                                    <span class="info-price"></span>
                                </div>
                                <div class="area">
                                    面积区间: 37m²
                                </div>
                                <div class="address">
                                    <span>[成华区 - 驷马桥]</span>
                                    成华成都市成华区驷马桥昭觉寺南路12号
                                </div>
                                <div class="labels">
                                                                                                                                                                                                            <a>多条地铁</a>
                                                                                                                                                                            <a>读书方便</a>
                                                                                                                                                                            <a>熙悦广场</a>
                                                                                                                                                                            <a>便利社区商业</a>
                                                                                                                                                                                                                                            </div>
                            </div>
                            <!-- 客服头像 -->
                            <div class="head_img">
                                <a href='javascript:;' class='im ' οnclick='easemobim.bind({tenantId:38251})'><p class="text">在线咨询</p></a>
 
                            </div>
                            <!--加入对比-->
                            <div class="add_compare ">
                                <p class="text" data-id="WJ000795"  data-img="https://www.51xiangzhu.com:6080/app/file/img.do?img=/usr/local/img/property/WJ000795/1320x240.JPG"><img src="https://cd.ixiangzhu.com/foreground/imgs/icon/add.png " alt="" > 添加对比</p>
                                <p class="have_add" style="display: none">已添加</p>
                            </div>
                        </div>
                                                <div class="house_item clearfix" data-temp-id="WJ000489">
                            <div class="f_left house_item_img">
                                <a href="https://cd.ixiangzhu.com/House/detail/WJ000489.html"><img src="https://www.51xiangzhu.com:6080/app/file/img.do?img=/usr/local/img/property/WJ000489/1320x240.JPG" width="230" height="160" /></a>
                            </div>
                            <div class="f_left house_item_info">
                                <div class="title clearfix">
                                    <a class="f_left" href="https://cd.ixiangzhu.com/House/detail/WJ000489.html">黄龙溪谷</a>
                                                                         <span class="price f_right">
                                        <span class="money">15747
                                                                                    元/m²
                                                                                </span>
                                    </span>
                                                                        </div>
                                <div class="info">
                                    <span>在售</span>
                                    <span class="info-price"></span>
                                </div>
                                <div class="area">
                                    面积区间: 167-328m²
                                </div>
                                <div class="address">
                                    <span>[眉山 - 彭山]</span>
                                    剑南大道南延线黄龙古镇旁
                                </div>
                                <div class="labels">
                                                                                                                                                                                                            <a>私家车出行方便</a>
                                                                                                                                                                            <a>读书方便</a>
                                                                                                                                                                                                                                            </div>
                            </div>
                            <!-- 客服头像 -->
                            <div class="head_img">
                                <a href='javascript:;' class='im ' οnclick='easemobim.bind({tenantId:38251})'><p class="text">在线咨询</p></a>
 
                            </div>
                            <!--加入对比-->
                            <div class="add_compare ">
                                <p class="text" data-id="WJ000489"  data-img="https://www.51xiangzhu.com:6080/app/file/img.do?img=/usr/local/img/property/WJ000489/1320x240.JPG"><img src="https://cd.ixiangzhu.com/foreground/imgs/icon/add.png " alt="" > 添加对比</p>
                                <p class="have_add" style="display: none">已添加</p>
                            </div>
                        </div>
                                                <div class="house_item clearfix" data-temp-id="WJ000811">
                            <div class="f_left house_item_img">
                                <a href="https://cd.ixiangzhu.com/House/detail/WJ000811.html"><img src="https://www.51xiangzhu.com:6080/app/file/img.do?img=/usr/local/img/property/WJ000811/1320x240.JPG" width="230" height="160" /></a>
                            </div>

（2）分析观察上面的代码，可以发现每个楼盘的名称是包含在一个<a> 标签中，比如：<a class="f_left" href="https://cd.ixiangzhu.com/House/detail/WJ000795.html">金科星耀天都</a>

（3）楼盘的价格是包含在一个<span class="money">标签中，比如：<span class="money">11892元/m² </span>

（4）利用BeautifulSoup，将包含所有楼盘数据的<div class="house_lists">标签提取出来，实现代码如下：

from bs4 import BeautifulSoup as bs
soup = bs(html_d, 'html.parser')
house_div = soup.find_all('div', class_='house_lists')

（5）再进一步提取所有包含楼盘名称的<a> 标签，实现代码如下：

house_list = house_div[0].find_all('a', class_='f_left')

其中house_list是一个列表，它包含了所有楼盘的名称。

（6）再进一步提取所有包含楼盘价格的<span class="money"> 标签，实现代码如下：

price_list = house_div[0].find_all('span', class_='money')

其中price _list是一个列表，它包含了所有楼盘的价格。

（7）需要从price _list列表提取楼盘的价格，因为其中保存的字符串中包含了楼盘的价格，内容类似下面的：

因为只需要提取其中的数字，所以采用了一个函数专门来实现这个功能。

（8）提取楼盘价格的函数如下：

def getPrice(str):
    digits = '1234567890'
    start_postion = 0
    #下面获取数字的起始位置
    for c in str:
        if c in digits:
            start_postion = str.index(c)
            break
 
   # start_postion保存了起始位置
   #下面切片截取了从数字的起始位置开始一直到字符串结束位置
    tempstr = str[start_postion:len(str)]
 
    end_pos = len(tempstr)
   #下面获取数字的截止位置
    for c in tempstr:
        if c in digits:
            continue
        else:
            end_pos = tempstr.index(c)
            break
 
   # end_pos保存了数字的截止位置
   #下面的切片将楼盘的价格提取出来
    price_str = tempstr[0:end_pos]
    return price_str

（9）利用迭代将楼盘名称和价格提取出来，并保存到字典中。实现代码如下：

for (ahref,aprice) in zip(house_list,price_list):
                housedictionary[ahref.text] = getPrice(aprice.text)

第三步，将字典中的数据保存到文件中。代码如下：

with open('allhouse.txt', 'w', encoding='utf-8') as f:
        f.write( str(housedictionary) + '\n' )
        f.close()

最终，所有的楼盘数据以字符串的形式保存到了文件allhouse.txt中。

以上代码在Python 3.6.3中运行通过。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：zabbix可以使用telne协议获取信息吗 zabbix 协议

下一篇：embed标签没有auto embed标签属性

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

爬取网站中elements 爬取网站数据代码

爬取网站中elements 爬取网站数据代码

51CTO博客