Python爬取微博所有原图 python获取微博内容

转载

mob6454cc6d3e23 2023-06-27 11:32:30

文章标签 Python爬取微博所有原图 json 发布平台加载 文章分类 Python 后端开发

此博客仅作为交流学习

对于喜爱的微博用户文章内容进行爬取

（此部分在于app页面进行爬取，比较方便）

分析页面

Python爬取微博所有原图 python获取微博内容_Python爬取微博所有原图

在这里进行json方法进行，点击Network进行抓包

Python爬取微博所有原图 python获取微博内容_发布平台_02

发现数据加载是由这个页面发出的，查看期发出的内容

Python爬取微博所有原图 python获取微博内容_加载_03

页面并不是一次性加载所有内容，而是在页面下拉的时候加载出后续内容

而后点击预览即可看见我们想要的元素

Python爬取微博所有原图 python获取微博内容_加载_04

开始提取网页信息

response = requests.get(url=url)
    data = response.json()
    #pprint.pprint(data)  #将页面内容规范为易懂可视页面
    next_url = data['data']['cardlistInfo']
    card = data['data']['cards']
    for card in card:
        #print(card)
        mblog = card.get('mblog', None)
        if mblog:
            # 有内容再提取
            mid = mblog.get('mid', None)  # id
            text = mblog.get('text', None)  # 文章标题
            source = mblog.get('source', None)  # 发布平台
            author_name = mblog.get('user', {}).get('screen_name', None)  # 发布作者
            created_at = mblog.get('created_at',None) #时间
            print([text, source, author_name, mid])

这样就可以提取到博主发布文章的信息了

解析如何提取评论的信息，点击评论进行抓包

Python爬取微博所有原图 python获取微博内容_Python爬取微博所有原图_05

进入评论页面

Python爬取微博所有原图 python获取微博内容_json_06

Python爬取微博所有原图 python获取微博内容_加载_07

那么我们只要在捕获博主页面后可以自动进入评论页面便好

寻找两个页面的联系，发现在博主页面的"mid"可以对于于评论链接的”mid“，那么只要捕获博主页面的"mid"，并将赋予到字符串连接便可以找到对应评论

# 微博下的评论
            comments_url = 'https://m.weibo.cn/comments/hotflow?id=' + mid + '&mid=' + mid + '&max_id_type=0'
            # 爬取每一条数据下面的评论
            requests.get(comments_url)
            comments_response = requests.get(comments_url)
            comments_data = comments_response.json()
            #print(comments_data)

            if comments_data['ok'] == 1: #只读取有评论微博
                data_list = comments_data['data']['data']
                for comment in data_list:
                    comment_text = comment.get('text',None)
                    comment_mid = comment.get('mid',None)
                    username = comment['user']['screen_name']
                    print([comment_mid,username,comment_text])

这样就可以得到博主其中一个文章的内容和其评论了

我们接下来分析每一个博主文章之间的联系，以便于我们可以的得到每一个文章的信息

下拉找寻新加载出来抓包，发现有多出来一个字符串

Python爬取微博所有原图 python获取微博内容_发布平台_08

对比上一个包发现只要捕获上一个包的”since_id“并赋予在上一个包的地址，便是下一个包的地址了

进行函数表达

def start(url):
    response = requests.get(url=url)
    data = response.json()
    #pprint.pprint(data)  #将页面内容规范为易懂可视页面
    next_url = data['data']['cardlistInfo']
    cat_url = next_url.get('since_id',None)
    new_url = url + '&since_id=' + str(cat_url)
    card = data['data']['cards']
    for card in card:
        #print(card)
        mblog = card.get('mblog', None)
        if mblog:
            # 有内容再提取
            mid = mblog.get('mid', None)  # id
            text = mblog.get('text', None)  # 文章标题
            source = mblog.get('source', None)  # 发布平台
            author_name = mblog.get('user', {}).get('screen_name', None)  # 发布作者
            created_at = mblog.get('created_at',None) #时间
            print([text, source, author_name, mid])
            #print(created_at)

            # 微博下的评论
            comments_url = 'https://m.weibo.cn/comments/hotflow?id=' + mid + '&mid=' + mid + '&max_id_type=0'
            # 爬取每一条数据下面的评论
            requests.get(comments_url)
            comments_response = requests.get(comments_url)
            comments_data = comments_response.json()
            #print(comments_data)

            if comments_data['ok'] == 1: #只读取有评论微博
                data_list = comments_data['data']['data']
                for comment in data_list:
                    comment_text = comment.get('text',None)
                    comment_mid = comment.get('mid',None)
                    username = comment['user']['screen_name']
                    print([comment_mid,username,comment_text])
    start(new_url)

这样我们便可以捕获到全部博主的文章和评论

结果：

Python爬取微博所有原图 python获取微博内容_json_09

完整代码

import requests
import pprint
import time

def start(url):
    response = requests.get(url=url)
    data = response.json()
    #pprint.pprint(data)  #将页面内容规范为易懂可视页面
    next_url = data['data']['cardlistInfo']
    cat_url = next_url.get('since_id',None)
    new_url = url + '&since_id=' + str(cat_url)
    card = data['data']['cards']
    for card in card:
        #print(card)
        mblog = card.get('mblog', None)
        if mblog:
            # 有内容再提取
            mid = mblog.get('mid', None)  # id
            text = mblog.get('text', None)  # 文章标题
            source = mblog.get('source', None)  # 发布平台
            author_name = mblog.get('user', {}).get('screen_name', None)  # 发布作者
            created_at = mblog.get('created_at',None) #时间
            #try = mblog.get('visible',{}).get('type',None)
            #print(try)
            print([text, source, author_name, mid])
            #print(created_at)

            # 微博下的评论
            comments_url = 'https://m.weibo.cn/comments/hotflow?id=' + mid + '&mid=' + mid + '&max_id_type=0'
            # 爬取每一条数据下面的评论
            requests.get(comments_url)
            comments_response = requests.get(comments_url)
            comments_data = comments_response.json()
            #print(comments_data)

            if comments_data['ok'] == 1: #只读取有评论微博
                data_list = comments_data['data']['data']
                for comment in data_list:
                    comment_text = comment.get('text',None)
                    comment_mid = comment.get('mid',None)
                    username = comment['user']['screen_name']
                    print([comment_mid,username,comment_text])
    time.sleep(2)
    start(new_url)

if __name__ == '__main__':
    url = '捕获的第一个包'
    start(url)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。