如何利用python造需要的数据集的数据 python创建一个数据集

转载

mob64ca13feda16 2023-10-21 18:03:35

文章标签 机器学习深度学习数据集 Python 人工智能 文章分类 Python 后端开发

机器学习或深度学习的第一步是获取数据集，一般我们使用业务数据集或公共数据集。本文将介绍使用 Bing Image Search API 和 Python 脚本，快速的建立自己的图片数据集。

1. 快速建立图片数据集，我们将使用 Bing Image Search API 建立自己的图片数据集。

首先进入 Bing Image Search API 网站：点击链接

如何利用python造需要的数据集的数据 python创建一个数据集_Python

点击“Get API Key”按钮

如何利用python造需要的数据集的数据 python创建一个数据集_Python_02

选择7天试用，点击“Get start”按钮

如何利用python造需要的数据集的数据 python创建一个数据集_Python_03

同意微软服务条款和勾选地区，点击“Next”按钮

如何利用python造需要的数据集的数据 python创建一个数据集_数据集_04

可以使用你的 Microsoft, Facebook, LinkedIn, 或 GitHub 账号登陆，我使用我的 GitHub 账号登陆。注册完成，进入Your APIs 页面。如下图所示：

如何利用python造需要的数据集的数据 python创建一个数据集_深度学习_05

向下拖动，可以查看可以使用的API列表和API Keys，注意红框部分，将在后面部分使用到。

如何利用python造需要的数据集的数据 python创建一个数据集_人工智能_06

至此，你已经有一个Bing Image Search API账号，并可以使用 Bing Image Search API 了。你可以访问：

了解更多关于 Bing Image Search API 如何使用的信息。下面将介绍编写Python脚本，使用 Bing Image Search API 下载图片。

2. 编写Python脚本下载图片

首先安装 requests 包，在终端执行命令

$ pip install requests

新建一个文件，命名为 search_bing_api.py，插入以下代码

# import the necessary packages
from requests import exceptions
import argparse
import requests
import os
import cv2

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-q", "--query", required=True,
	help="search query to search Bing Image API for")
args = vars(ap.parse_args())

query = args["query"]
output = "/Users/simon/AI/dataset/" + query

# set your Microsoft Cognitive Services API key along with (1) the
# maximum number of results for a given search and (2) the group size
# for results (maximum of 50 per request)
API_KEY = "YOUR Bing Image Search API Key"
MAX_RESULTS = 250
GROUP_SIZE = 50
 
# set the endpoint API URL
URL = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"

# when attempting to download images from the web both the Python
# programming language and the requests library have a number of
# exceptions that can be thrown so let's build a list of them now
# so we can filter on them
EXCEPTIONS = set([IOError, FileNotFoundError,exceptions.RequestException, exceptions.HTTPError,exceptions.ConnectionError, exceptions.Timeout])


# store the search term in a convenience variable then set the
# headers and search parameters
term = query
headers = {"Ocp-Apim-Subscription-Key" : API_KEY}
params = {"q": term, "offset": 0, "count": GROUP_SIZE}
 
# make the search
print("[INFO] searching Bing API for '{}'".format(term))
search = requests.get(URL, headers=headers, params=params)
search.raise_for_status()
 
# grab the results from the search, including the total number of
# estimated results returned by the Bing API
results = search.json()
estNumResults = min(results["totalEstimatedMatches"], MAX_RESULTS)
print("[INFO] {} total results for '{}'".format(estNumResults,term))
 
# initialize the total number of images downloaded thus far
total = 0

# loop over the estimated number of results in `GROUP_SIZE` groups
for offset in range(0, estNumResults, GROUP_SIZE):
	# update the search parameters using the current offset, then
	# make the request to fetch the results
	print("[INFO] making request for group {}-{} of {}...".format(
		offset, offset + GROUP_SIZE, estNumResults))
	params["offset"] = offset
	search = requests.get(URL, headers=headers, params=params)
	search.raise_for_status()
	results = search.json()
	print("[INFO] saving images for group {}-{} of {}...".format(
		offset, offset + GROUP_SIZE, estNumResults))

# loop over the results
	for v in results["value"]:
		# try to download the image
		try:
			# make a request to download the image
			print("[INFO] fetching: {}".format(v["contentUrl"]))
			r = requests.get(v["contentUrl"], timeout=30)
 
			# build the path to the output image
			ext = v["contentUrl"][v["contentUrl"].rfind("."):]
			p = os.path.sep.join([output, "{}{}".format(str(total).zfill(8), ext)])
 
			# write the image to disk
			f = open(p, "wb")
			f.write(r.content)
			f.close()
			image = cv2.imread(p)
 
			# if the image is `None` then we could not properly load the
			# image from disk (so it should be ignored)
			if image is None:
				print("[INFO] deleting: {}".format(p))
				os.remove(p)
				continue
 
		# catch any errors that would not unable us to download the
		# image
		except Exception as e:
			# check to see if our exception is in our list of
			# exceptions to check for
			if type(e) in EXCEPTIONS:
				print("[INFO] skipping: {}".format(v["contentUrl"]))
				continue
 
		# update the counter
		total += 1

以上为所有的Python下载图片代码，注意以下红框部分替换为自己的文件目录和自己的 Bing Image Search API Key。

如何利用python造需要的数据集的数据 python创建一个数据集_数据集_07

3. 运行下载脚本，下载图片

创建图片存储主目录，在终端执行命令

$ mkdir dataset

创建当前下载内容的存储目录，在终端执行命令

$ mkdir dataset/pikachu

终端执行命令如下命令，开始下载图片

$ python search_bing_api.py --query "pikachu"

[INFO] searching Bing API for 'pikachu'
[INFO] 250 total results for 'pikachu'
[INFO] making request for group 0-50 of 250...
[INFO] saving images for group 0-50 of 250...
[INFO] fetching: http://images5.fanpop.com/image/photos/29200000/PIKACHU-pikachu-29274386-861-927.jpg
[INFO] skipping: http://images5.fanpop.com/image/photos/29200000/PIKACHU-pikachu-29274386-861-927.jpg
[INFO] fetching: http://images6.fanpop.com/image/photos/33000000/pikachu-pikachu-33005706-895-1000.png
[INFO] skipping: http://images6.fanpop.com/image/photos/33000000/pikachu-pikachu-33005706-895-1000.png
[INFO] fetching: http://images5.fanpop.com/image/photos/31600000/Pikachu-with-pokeball-pikachu-31615402-2560-2245.jpg

按照相同的方法下载其他图片：charmander，squirtle，bulbasaur，mewtwo

下载 charmander

$ mkdir dataset/charmander

$ python search_bing_api.py --query "charmander"

下载 squirtle

$ mkdir dataset/squirtle

$ python search_bing_api.py --query "squirtle"

下载 bulbasaur

$ mkdir dataset/bulbasaur

$ python search_bing_api.py --query "bulbasaur"

下载 mewtwo

$ mkdir dataset/mewtwo

$ python search_bing_api.py --query "mewtwo"

下载的图片如下图所示

如何利用python造需要的数据集的数据 python创建一个数据集_数据集_08

下载全部完成大约需要30多分钟时间，最终五个文件夹下的图片内容如下

如何利用python造需要的数据集的数据 python创建一个数据集_Python_09

为了更好的训练模型，我们应该进行图片筛选，将不合适的图片删除掉。比如在某一个分类文件夹下，将不属于这个分类的图片删除掉，将包含了其他分类的图片删除等。筛选方法为，打开文件夹，浏览图片，手工进行筛选。

至此我们已经建立自己的图片数据集了。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：java中的超文本如何存储在数据库超文本实例

下一篇：python代码块的关系 python用什么代表代码块

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯