各位未来国家栋梁们好啊~相信栋梁们经常需要在网络上寻找各种资源,作为二次元的必备精神食粮,图片资源那是必不可少!在这里用python写了一个超简单的图片爬取小项目~话不多说,附上源码!(有用的话点个赞或者以后会用点个收藏不迷路哦~)


1.准备工作。

  这里只需要准备好pycharm即可,但要保持网络通畅通,使用pycharm新建项目后一定要在.py源文件同级目录下创建一个空的名为image的文件夹,爬取的图片将会保存在image文件夹下。

python 坤坤_爬虫


2.创建python文件后直接将代码复制进去即可。(快端上来罢~)

import requests
import json

"""
version1
"""
# https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1670164898311_R&pv=&ic=&nc=1&z=&hd=&latest=©right=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&dyTabStr=MCwzLDYsNCw1LDEsOCw3LDIsOQ%3D%3D&ie=utf-8&sid=&word=%E5%8A%A8%E6%BC%AB

# header用于伪装自己的程序以获得服务器的数据请求许可
headers = {
    "Host": "image.baidu.com",
    "Cookie": "BDqhfp=%E5%8A%A8%E6%BC%AB%26%26-10-1undefined%26%268364%26%269; BIDUPSID=633DBE996B8FF3C8C468EEC5725F44F1; PSTM=1667711202; BAIDUID=633DBE996B8FF3C8033B690572DCDCE2:FG=1; BAIDUID_BFESS=633DBE996B8FF3C8033B690572DCDCE2:FG=1; ZFY=:AZLVAVUXbk0EZCCK98sVRIwyMPfIvN3lto4EyxILmRc:C; H_PS_PSSID=37855_36554_37840_37766_37760_26350_22158_37882; delPer=0; PSINO=6; BA_HECTOR=04240g058500a4al2l018oua1hopajr1g; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; BDRCVFR[X_XKQks0S63]=mk3SLVN4HKm; firstShowTip=1; indexPageSugList=%5B%22%E5%8A%A8%E6%BC%AB%22%5D; cleanHistoryStatus=0; userFrom=null; ab_sr=1.0.1_ZmNiNzk0NWRiMjhjOGRiZDY1NjMwODkxYjdlMDAwMGJmMDhjYjMwNGM5MGQwOGFjMjJmN2Y1NTNmNzkxZjEzOWM3Yjg0NWQ3ZGVjNzMwOWMzMTQ2MzI1ZTcwY2JkYWIxOWNiZDMzNTk1ZjVkMWEzNWRiNjI2MGMyNTVkMjlkODVhOWY3Njk0MzFlOGI5N2UzNDQyNzNkZGUzMWQ1YzM2OQ==",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
}

# 对图片的命名,默认从1开始(可自行修改)
num = 1
# 需要爬取的页数,数越大,爬取的图片越多
page = 11



def spider_picture(num, page):
    for i in range(1, page):  # 循环执行每一张图片的获取,其中末尾的{i*30}是通过分析每一张图片的URL发现的规律
        url = f"https://image.baidu.com/search/acjson?tn=resultjson_com&logid=8419943923068396246&ipn=rj&ct=201326592&is=&fp=result&fr=&word=%E5%8A%A8%E6%BC%AB&queryWord=%E5%8A%A8%E6%BC%AB&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=&hd=&latest=©right=&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&expermode=&nojc=&isAsync=&pn={i * 30}&rn=30&gsm=14a&1670165078825="
        response = requests.get(url, headers=headers)
        # print(response.text)
        # 将获取到的信息装换为json数据,以便python能够处理
        json_data = response.json()
        # 观察发现key值为data的字典值里面有我们需要的图片连接
        data_list = json_data["data"]
        for data in data_list[:-1]:
            print(data)
            # 在data中取出每一张图片的URL
            picture_url = data["hoverURL"]
            # 通过get函数获取每一张图片的信息,content表示将获取到的信息装换为二进制数据
            image_data = requests.get(picture_url).content
            # 把图片存入文件夹
            with open(f"image/pic_{num}.jpg", "wb") as fp:
                fp.write(image_data)
            num += 1


if __name__ == "__main__":
    spider_picture(1, 5)

3.这里需要用到requests库,如果没有安装这个库,需要安装该库

  • 方法一:鼠标放在代码中import requests上,点击安装库。
  • 方法二:打开控制台cmd窗口,输入以下代码回车。
pip install requests

4.需要注意的事项。

  •   这里的参数num:表示的是为图片的命名,只能为数字,默认设置的从1开始,因此,图片名依次为pic_1,pic_2,pic_3...等。(可以自行修改。)
  •   这里的参数page:表示的是需要爬取的网页图片页数,设置的越大,爬取的图片越多。(可以自行修改。)
  • 重要如果你已经使用过一次,并爬取了图片,需要将image文件夹里面的图片移出去,或者删除后再进行第二次使用。(因为图片名默认从pic_1开始的,如果第二次运行时image文件夹里面有同名图片,则该同名图片将会被覆盖,只有非同名图片才会被保存。获取的是content转化后的二进制数据保存在.jpg格式的文件下。)

5.补充。

  下面有另一个版本,如果有需要,可以实现该接口里面的音乐爬取方法和视频爬取方法练练手吧。(萌新害怕.jpg)给栋梁端上version2版本。

import requests
import json
"""
version2
"""
# https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1670164898311_R&pv=&ic=&nc=1&z=&hd=&latest=©right=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&dyTabStr=MCwzLDYsNCw1LDEsOCw3LDIsOQ%3D%3D&ie=utf-8&sid=&word=%E5%8A%A8%E6%BC%AB

headers = {
    "Host": "image.baidu.com",
    "Cookie": "BDqhfp=%E5%8A%A8%E6%BC%AB%26%26-10-1undefined%26%268364%26%269; BIDUPSID=633DBE996B8FF3C8C468EEC5725F44F1; PSTM=1667711202; BAIDUID=633DBE996B8FF3C8033B690572DCDCE2:FG=1; BAIDUID_BFESS=633DBE996B8FF3C8033B690572DCDCE2:FG=1; ZFY=:AZLVAVUXbk0EZCCK98sVRIwyMPfIvN3lto4EyxILmRc:C; H_PS_PSSID=37855_36554_37840_37766_37760_26350_22158_37882; delPer=0; PSINO=6; BA_HECTOR=04240g058500a4al2l018oua1hopajr1g; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; BDRCVFR[X_XKQks0S63]=mk3SLVN4HKm; firstShowTip=1; indexPageSugList=%5B%22%E5%8A%A8%E6%BC%AB%22%5D; cleanHistoryStatus=0; userFrom=null; ab_sr=1.0.1_ZmNiNzk0NWRiMjhjOGRiZDY1NjMwODkxYjdlMDAwMGJmMDhjYjMwNGM5MGQwOGFjMjJmN2Y1NTNmNzkxZjEzOWM3Yjg0NWQ3ZGVjNzMwOWMzMTQ2MzI1ZTcwY2JkYWIxOWNiZDMzNTk1ZjVkMWEzNWRiNjI2MGMyNTVkMjlkODVhOWY3Njk0MzFlOGI5N2UzNDQyNzNkZGUzMWQ1YzM2OQ==",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
}

num = 1
page = 11


class spider_sp:
    __flag = True
    __instance = None

    def spider_pictue(self, num, page):
        pass

    def spider_music(self):
        pass

    def spider_vedio(self):
        pass


class spider(object):
    __flag = True
    __instance = None

    def __new__(cls, *args, **kwargs):
        if cls.__instance:
            cls.__instance = super().__new__(cls)
            return cls.__instance

    def __init__(self, file_number, page_number):
        if self.__flag:
            self.num = file_number
            self.page = page_number
            __flag = False

    def spider_picture(self, num, page):
        for i in range(1, page):
            url = f"https://image.baidu.com/search/acjson?tn=resultjson_com&logid=8419943923068396246&ipn=rj&ct=201326592&is=&fp=result&fr=&word=%E5%8A%A8%E6%BC%AB&queryWord=%E5%8A%A8%E6%BC%AB&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=&hd=&latest=©right=&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&expermode=&nojc=&isAsync=&pn={i * 30}&rn=30&gsm=14a&1670165078825="
            response = requests.get(url, headers=headers)
            # print(response.text)
            json_data = response.json()
            data_list = json_data["data"]
            for data in data_list[:-1]:
                print(data)
                picture_url = data["hoverURL"]
                image_data = requests.get(picture_url).content
                with open(f"image/pic_{num}.jpg", "wb") as fp:
                    fp.write(image_data)
                num += 1


if __name__ == "__main__":
    worker_1 = spider(1, 5)
    try:
        worker_1.spider_picture()
    except Exception:
        print(f"爬取图片出现错误!请检查header或者URL连接!")

  写完啦~下班~

注:本文爬取的图片属于百度的公开图库,代码中的多行注释中有原网页。本人学习中,发文练练手,想和大佬们一起进步。

python 坤坤_web_02

嘿嘿~代码香香的,软软的~嘿嘿~

这个赞我是不会要的,啊,不要点啊~说了明明不要的,非要给我那也没办法了

python 坤坤_python 坤坤_03