新型冠状肺炎疫情数据爬虫

转载

limiyq 2020-09-04 21:46:31

文章标签 java 文章分类 Java 后端开发

嗨,你好小伙伴~本篇不为蹭热点,仅分享思路

从年三十回到家里后，已经乖乖的在家中呆到现在了，这个年我与大家一样，每天都会看疫情的数据、刷朋友圈、各种群里聊天，紧张、担忧。而除此之外就是控制好自己，别出门。今天要分享的是我在年前29、30号那个夜里写的爬虫——新冠肺炎疫情数据。此外今天我看每日新增和累计确诊的数据曲线有平缓的趋势，愿手机前的你健健康康。

1.思路与最终效果

在我写这段爬虫的时候丁香医生还没有上线数据曲线，因此我想自己沉淀一套数据，所以思路上要有以下几方面满足：

要能存储到csv文件中满足后续数据分析练习；要能定时定点的刷新，如一小时刷新一次之类；
如果数据未变化就不储存，如果变化则存入csv中；
如果数据变化则微信发送给我全国总数和各省市总数，并打上时间标签。这样省去主动查看的时间，也能第一时间接到数据。
最终是做成了，两个截图效果：截图1：夜里自动更新数据截图2：中国总人数的记录

2. 爬数据过程

数据源有两个，这里有个有趣的经过，但当时让我挺苦恼，刚开始爬丁香医生疫情数据的时候没问题的，爬了三版后，丁香医生数据加入了省下边各市的分布数据，而北京的则加入了下边各个区的数据，导致我这边刷出来不变化，后来反而转战网易的数据爬，因此数据源有两个。

import requests, random, bs4  ,re , time ,csv,itchat
from datetime import datetime
from bs4 import BeautifulSoup

now_time = datetime.now().strftime('%Y-%m-%d %H:%M')

data_url= "https://3g.dxy.cn/newh5/view/pneumonia_peopleapp"
data_url2 = "http://news.163.com/special/epidemic/"
f_path = "./data.csv"
sf_path = "./data-sum.csv"

解释一下，时间戳的作用是下边储存的时候或者发送的时候能打印出来刷新时间，不然我不知道刷新没有。

2.1. 更新总数过程抓总数定义一个函数（这是我第一次在爬虫中用oop思路，还不错，容易控制开启或者关闭它）思路是：首先从csv判断最后一个格子上次存的更新时间，这个时间如果与这次刷新网页得到的一样（截图这个时间），就不做任何动作了，如果不一样，则开始爬数并储存。代码如下：

# 抓总数
def get_sums():
    # 获取上次时间
    with open(sf_path, 'r', newline='', encoding='utf8') as f:
        rcsv = csv.reader(f)
        for rows in rcsv:
            late_time = rows[-1]
        f.close()

    res_page = requests.get(data_url)  # 获得URL数据
    res_page.encoding = 'utf-8'
    bs_page = BeautifulSoup(res_page.text, 'html.parser')  # 解析数据
    # print(bs_page)
    # 获取时间戳
    time_data = bs_page.find_all(
        'p', class_="mapTitle___2QtRg")  # 查找最小父级标签class="item"
    time_str = time_data[0].text
    time_num = re.findall("截至 (.+?) 数据统计", time_str)
    time_num = time_num[0]
    times = time.strftime(
        '%Y-%m-%d %H:%M', time.strptime(time_num, "%Y-%m-%d %H:%M"))
    # print(times)
    times_str = str(times)
    if times_str != late_time:
        # 获取总体数据方面
        sums_msg = []
        sum_data = bs_page.find_all(
            'span', class_="content___2hIPS")  # 查找最小父级标签class="item"
        for data in sum_data:
            heads = {"确诊": 0, "疑似": 0,
                        "治愈": 0, "死亡": 0, "更新日期": ""}
            area_msg = data.text
            area_msg = area_msg.replace(" ", "")
            print(area_msg)
            recon_num = re.findall("确诊{1}(\d*)", area_msg)
            spy_num = re.findall("疑似{1}(\d*)", area_msg)
            cure_num = re.findall("治愈{1}(\d*)", area_msg)
            deth_num = re.findall("死亡{1}(\d*)", area_msg)
            try:
                heads['确诊'] = recon_num[0]
            except Exception as e:
                pass
            try:
                heads['疑似'] = spy_num[0]
            except Exception as e:
                pass
            try:
                heads['治愈'] = cure_num[0]
            except Exception as e:
                pass
            try:
                heads['死亡'] = deth_num[0]
            except Exception as e:
                pass
            try:
                heads['更新日期'] = times
            except Exception as e:
                pass
            # print(recon_num, spy_num, cure_num, deth_num,times)
            # print(area_msg)
            # print(heads)
            sums_msg = str(list(heads.keys()))+"\n"+str(list(heads.values())) + "\n"+"致死率:" + str(round(int(str(heads['死亡']))/int(str(heads['确诊']))*100, 2))+"%\n"
            # 写入数据
            # file_path 是 csv 文件存储的路径

            with open(sf_path, 'a', newline='', encoding='utf8') as f:
                w = csv.writer(f, delimiter=',')
                w.writerow(list(heads.values()))
            print(sums_msg)
            # itchat.send(sums_msg, toUserName='filehelper')
        return times_str, "★★★★★总数已更新"
    else:
        return times_str, "     总数没有更新"

过程中用了很多的print来检测，这些痕迹我就不删了，总之就是更新了则发送微信给我，然后并打印★★★★★总数已更新如果没有更新则显示总数没有更新之所以加空格是为了打印出来对齐。

2.2. 更新各省市过程跟上边思路是一样的，有个个人的不足--时间戳来回倒腾格式那块我弄的有点麻烦，有好的方法请告诉我。

# 抓各地的
def get_heads():
    # 获取上次时间
    with open(f_path, 'r', newline='', encoding='utf8') as f:
        rcsv = csv.reader(f)
        for rows in rcsv:
            late_time2 = rows[-1]
        f.close()
    # print("文件的时间", late_time2)
    res_page = requests.get(data_url2)#获得URL数据
    # res_page.encoding='utf-8'
    bs_page = BeautifulSoup(res_page.text,'html.parser')# 解析数据
    # print(bs_page)
    # 获取时间戳
    time_data = bs_page.find_all('div', class_="tit")  # 查找最小父级标签class="item"
    time_str= time_data[0].text
    time_num = re.findall("截止(.+)?", time_str)
    time_num=list(time_num)[0]
    time_num=str(time_num)
    times_str2 = str(time_num)
    print("抓取的时间", times_str2)
    if times_str2 != late_time2:
        print("时间不相同")
        citys_msg=''
        # 总数
        sums_data = bs_page.find_all('div',class_="cover_tit_des")  # 查找最小父级标签class="item"
        for data in sums_data:
            print(data.text)
            retext=data.text
            retext=retext.replace(" ","")
            china = str(list(re.findall("(.+)确诊", retext))[0])
            quezhen = str(list(re.findall("确诊(.+)例，", retext))[0])
            siwang = str(list(re.findall("死亡(.+)例", retext))[0])
            siwanglv=round(int(siwang)/int(quezhen)*100,2)

            wrlist =[]
            wrlist.append(china)
            wrlist.append(quezhen)
            wrlist.append(siwang)
            wrlist.append(siwanglv)
            wrlist.append(times_str2)
            with open(sf_path, 'a', newline='', encoding='utf8') as f:
                w = csv.writer(f, delimiter=',')
                w.writerow(wrlist)
                f.close()
            nation=china+"\n确诊:"+quezhen+"\n死亡:"+siwang+"\n死亡率:"+str(siwanglv)+"%\n"
            print(nation)
            itchat.send(nation, toUserName='filehelper')
        # 获取数据方面
        list_data = bs_page.find_all('li')# 查找最小父级标签class="item"
        for data in list_data:
            write_list=[]

            city=list(re.findall("(.+)确诊",data.text))[0]
            # print(city)
            con_num = list(re.findall("确诊(.+)例", data.text))[0]
            # print(con_num)

            write_list.append(city)
            write_list.append(con_num)
            write_list.append(times_str2)
            print(write_list)
            # 写入数据
            # file_path 是 csv 文件存储的路径

            with open(f_path, 'a', newline='', encoding='utf8') as f:
                w = csv.writer(f, delimiter=',')
                w.writerow(write_list)
                f.close()
            citys_msg = citys_msg+str(write_list)+'\n'
        print(citys_msg)
        itchat.send(citys_msg, toUserName='filehelper')
        return times_str2, "◆◆◆◆◆各省数据已更新"
    else:
        return times_str2, "     各省没有更新"

登录微信及启动两个爬数函数弄好了就可以开始运行了，首先登录微信，然后就进入循环，两个变量启动写好的函数，然后打印返回值，那个break是我在调试时候用的，也就是省的我自己手动终止它运行了，一遍直接break，供我有时间检查bug或者修改想法。最后没问题了break注释掉就行了。这个回家后就没再挂着了所以后来没有更新的数据，如果你感兴趣，欢迎手动实战一遍。

# 登录微信
itchat.auto_login()
while True:
    itchat.send("★★登录成功★★", toUserName='filehelper')
    a = get_sums()
    b = get_heads()
    print(a,"fin",now_time)
    print(b,'fin',now_time)
    print("\n")
    time.sleep(1200) #20分钟刷新
    # break

现在看24日抓的数据那时候还是440多，时光匆匆，朋友请保护好自己。

另外呢，点这个链接粉字，看回形针写的关于这个疫情的科普视频。

MySQL会继续的，回家确实懈怠。历史文章：数据库我来啦, 如果你对数据库感兴趣一起看看 SQL语言的这点必备知识

——The End—— 欢迎留言、转发、关注SUMER、点击「在看」一条龙~