python 抓取小红书

原创

mob649e815e258d 2024-07-26 11:22:06 ©著作权

文章标签 html python 数据抓取 文章分类 Python 后端开发

©著作权归作者所有：来自51CTO博客作者mob649e815e258d的原创作品，请联系作者获取转载授权，否则将追究法律责任

Python 抓取小红书数据的科普文章

小红书是一个流行的社交电商平台，用户可以分享购物心得、生活点滴等。本文将介绍如何使用Python语言抓取小红书的数据，包括用户信息、笔记内容等。

环境准备

在开始之前，确保你的Python环境已经安装了以下库：

requests：用于发送HTTP请求。
BeautifulSoup：用于解析HTML文档。
pandas：用于数据处理和导出。

可以使用以下命令安装这些库：

pip install requests beautifulsoup4 pandas

抓取小红书用户信息

首先，我们以抓取小红书用户信息为例，介绍如何使用Python进行数据抓取。

发送HTTP请求，获取用户信息页面的HTML内容。

import requests

url = '  # 替换为实际的用户名
response = requests.get(url)
html = response.text

使用BeautifulSoup解析HTML，提取用户信息。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
user_info = soup.find('div', class_='user-info')  # 根据实际页面结构调整
username = user_info.find('h2').text.strip()
followers = user_info.find('span', class_='count').text.strip()

将抓取到的数据存储到Pandas DataFrame中。

import pandas as pd

data = {'username': [username], 'followers': [followers]}
df = pd.DataFrame(data)

抓取小红书笔记内容

接下来，我们介绍如何抓取小红书用户的笔记内容。

获取用户笔记列表页面的URL。

notes_url = f'

发送HTTP请求，获取笔记列表页面的HTML内容。

response = requests.get(notes_url)
notes_html = response.text

使用BeautifulSoup解析HTML，提取笔记链接。

notes_soup = BeautifulSoup(notes_html, 'html.parser')
notes_links = notes_soup.find_all('a', class_='note-item')

遍历笔记链接，抓取每篇笔记的详细内容。

for link in notes_links:
    note_url = link['href']  # 获取笔记的URL
    note_response = requests.get(note_url)
    note_html = note_response.text

    note_soup = BeautifulSoup(note_html, 'html.parser')
    title = note_soup.find('h1').text.strip()
    content = note_soup.find('div', class_='note-content').text.strip()

    # 将抓取到的数据存储到DataFrame中
    note_data = {'title': title, 'content': content}
    note_df = pd.DataFrame([note_data])
    df = pd.concat([df, note_df])

数据导出

最后，我们可以将抓取到的数据导出为CSV文件。

df.to_csv('xiaohongshu_data.csv', index=False)

项目进度管理

在进行数据抓取项目时，合理的进度管理是非常重要的。以下是一个使用Mermaid语法绘制的甘特图，展示了项目的主要阶段和时间安排。

gantt
    title 小红书数据抓取项目进度
    dateFormat  YYYY-MM-DD
    section 环境准备
    安装库    :done,    des1, 2024-01-01,2024-01-02
    配置环境  :active,  des2, after des1, 3d

    section 数据抓取
    抓取用户信息  :         des3, after des2, 5d
    抓取笔记内容  :         des4, after des3, 7d

    section 数据处理
    数据清洗     :         des5, after des4, 3d
    数据导出     :         des6, after des5, 1d