python 爬取amazoo Python 爬取北京公交线路

转载

编程小达 2023-10-13 22:22:52

文章标签 python 爬取amazoo 数据库 database python ide 文章分类 Python 后端开发

前言就不过多赘述了，大家只要把scrapy的基本了解之后就可以完成这个项目。

一：创建scrapy项目：

python 爬取amazoo Python 爬取北京公交线路_ide

打开控制台输入 scrapy startproject beibus(这个是项目名称，可以自己修改)

进入项目文件夹创建爬虫scrapy genspider (爬虫名) (域名)

python 爬取amazoo Python 爬取北京公交线路_ide_02

查看beibus项目，如果有刚刚创建的爬虫名称的py文件，说明爬虫创建成功

python 爬取amazoo Python 爬取北京公交线路_python_03

二：设置settings

打开settings.py文件，修改ROBOTSTXT_OBEY的值为False

python 爬取amazoo Python 爬取北京公交线路_database_04

并且将DEFAULT_REQUEST_HEADERS参数取消注释，在后方加上User-Agent参数，这个参数大家可以自行选择网站，笔者这边选择的是www.baidu.com的headers

python 爬取amazoo Python 爬取北京公交线路_database_05

三：创建并编写start_requests方法

class BeiBusSpider(scrapy.Spider):
    name = 'bei_bus'
    allowed_domains = ['beijing.8684.cn']
    start_urls = 'http://beijing.8684.cn'
    def start_requests(self):
        for page in range(2):
            url='{url}/list{page}'.format(url=self.start_urls,page=(page+1))
            yield scrapy.FormRequest(url,callback=self.parse_index)
    def parse_index(self,response):
        pass
    def parse(self, response):
        pass

此方法名固定，启动scrapy首先启动该方法。

获取一级url之后使用callback将url回调到parse_index方法

四：创建并编写parse_index方法

def parse_index(self,response):
        beijing=response.xpath('//div[@class="cc-content service-area"]/div[2]/a/@href').extract()
        for href in beijing:
            url2=urljoin(self.start_urls,href)
            yield scrapy.FormRequest(url2,self.parse_detail)

使用response.xpath获取需要的数据，笔者这边获取的是公交路线之后的二级url

python 爬取amazoo Python 爬取北京公交线路_database_06

python 爬取amazoo Python 爬取北京公交线路_python_07

使用urljoin拼接url和刚刚爬取的href获取二级url之后callback回调到parse_detail方法

五：在parse_detail方法获取需要爬取的数据

笔者这边爬取的数据是公交路线名称，公交路线以及起点和终点

python 爬取amazoo Python 爬取北京公交线路_ide_08

python 爬取amazoo Python 爬取北京公交线路_数据库_09

python 爬取amazoo Python 爬取北京公交线路_database_10

def parse_detail(self,response):
        name=response.xpath('//h1[@class="title"]/text()').extract()
        trip=response.xpath('//div[@class="trip"]/text()').extract()
        luxian=response.xpath('//div[@class="bus-lzlist mb15"]/ol/li/a/text()').extract()

name

python 爬取amazoo Python 爬取北京公交线路_database_11

trip

python 爬取amazoo Python 爬取北京公交线路_python 爬取amazoo_12

luxian

python 爬取amazoo Python 爬取北京公交线路_python 爬取amazoo_13

至此，数据爬取完成。下面是scrapy爬取数据导入数据库的步骤。

一：进入 items.py 文件，修改其中的 class 以格式化数据

python 爬取amazoo Python 爬取北京公交线路_ide_14

二：在 bei_bus.py 文件的 parse_detail方法的末尾添加如下语句，以格式化数据

bus_item = BeibusItem()
for field in bus_item.fields:
bus_item[field] = eval（field)
yield bus_item

三：在 MySQL 的 bus 数据库中创建一个 businfo表

python 爬取amazoo Python 爬取北京公交线路_database_15

四：在 settings.py 文件末尾添加如下参数

DB_HOST = 'localhost'
DB_USER = 'root'
DB_PWD = 'liu19780928'
DB_CHARSET='UTF8'
DB = 'bus'

（根据个人实际更改）

五：在 pipelines.py 文件中，修改 class 名称为 MySQL Pipeline，添加初始化方法，将 host、user、password、db、charset 从 settings 中读取出来，并通过添加一个 connect()方法建立与数据库的连接

class MysqlPipeline(object):
    def __init__(self):
        self.host = settings.DB_HOST
        self.user = settings.DB_USER
        self.pwd = settings.DB_PWD
        self.db = settings.DB
        self.charset = setting.DB_CHARSET
        self.connect()

    def connect(self):
        self.conn = pymysql.connect(host=self.host,
                                    user=self.user,
                                    password=self.pwd,
                                    db=self.db,
                                    charset=self.charset)
        self.cursor=self.conn.cursor()
    def process_item(self, item, spider):
        return item

六：添加一个 close_spider方法，用于关闭 MySQL 数据库连接

def close_spider(self, spider):
        self.conn.close()
        self.cursor.close()

七：实现 process_item()方法，用于完成向数据库中插入数据的操作

def process_item(self, item, spider):
        sql='insert into businfo values("%s","%s","%s")'%(item['name'],item['trip'],item['luxian'])
        self.cursor.execute(sql)
        self.conn.commit()
        return item

八：在 settings.py 中将 ITEM_PIPELINES 方法的注释去掉，并将其中的内容改为“'beibus.pipelines.MysqlPipeline':300”，

ITEM_PIPELINES ={
'beibus.pipelines.MysqlPipeline': 300
}

九：执行 “scrapy crawl bei_bus”命令，启动爬虫项目

python 爬取amazoo Python 爬取北京公交线路_database_16

最后查看一下mysql数据

select * from businfo;

python 爬取amazoo Python 爬取北京公交线路_数据库_17

Scrapy爬取北京公交完成！

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：spring redis session退出 spring redis session 过期

下一篇：python将连续序列保存到列表中 python 最长连续子序列

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

python 爬取amazoo Python 爬取北京公交线路

python 爬取amazoo Python 爬取北京公交线路

51CTO博客