semantic kitti数据集介绍

转载

mob64ca140651e5 2024-07-11 18:02:16

文章标签 semantic kitti数据集介绍机器学习数据集数据可选参数 文章分类 机器学习人工智能

文章目录

Scikit-learn 数据集

数据集简介
toy数据集

糖尿病数据集

糖尿病数据集介绍
数据集使用

生成服从特定概率分布的数据集

1. 单标签数据集
2. 有噪点的数据集
3. 高斯分布数据集
4. 产生二分类数据

其他数据集

图片数据集
svmlight/libsvm格式的数据集
从openml.org下载数据集

Scikit-learn 数据集

在使用Scikit-learn数据集之前要先引用相应的包sklearn.datasets。数据集详细的信息还是要参考官方文档的。

数据集简介

The dataset loaders. 可以加载sklearn包自带的一些小型数据集，例如Toy datasets.
The dataset fetchers. 可以下载并加载一些大型数据集。

上面两种接口可以通过属性n_samples*n_features获得长度为n_samples的numpy array格式的数据；键data访问数据，键target获得标签。

如果只是想获得数据和标签的元组，可以将输入参数return_X_y设为True。

The dataset generation functions. 可以生成一些常见的数据集，例如服从高斯分布的数据。返回n_samples*n_features的元组(X, y)。
还包含其他数据集，例如图片、svmlight/libsvm格式的数据集等。

toy数据集

导入数据	简介	任务	大小(样本*属性)
load_boston([return_X_y])	boston房价	回归	506*13
load_iris([return_X_y])	鸢尾花数据集	分类	150*4
load_diabetes([return_X_y])	糖尿病数据集	回归	442*10
load_digits([n_class, return_X_y])	手写字数据集	分类	1797*64
load_linnerud([return_X_y])	健身数据集	多分类	20*3
load_wine([return_X_y])	红酒数据集	分类	178*13
load_breast_cancer([return_X_y])	乳腺癌数据集	分类	569*30

糖尿病数据集

糖尿病数据集介绍

这是一个糖尿病的数据集，主要包括442行数据，10个属性，分别是：Age(年龄)、性别(Sex)、Body mass index(体质指数)、Average Blood Pressure(平均血压)、S1~S6一年后疾病级数指标。Target为一年后患疾病的定量指标。

数据集使用

可以通过下面的语句查看数据集信息

from sklearn import datasets
diabetes = datasets.load_diabetes()                         #载入数据
print(diabetes.data)                                         #数据
print(diabetes.target)                                       #类标
print(diabetes.feature_names)                                #特征
print(u'总行数: ', len(diabetes.data), len(diabetes.target)) #数据总行数
print(u'特征数: ', len(diabetes.data[0]))                    #每行数据集维数
print(u'数据类型: ', diabetes.data.shape)                    #类型
print(type(diabetes.data), type(diabetes.target))            #数据集类型

也可以访问数据集的DESCR属性获得数据集的完整描述，而某些数据集包含feature_names和target_names。

获得数据可以使用diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y = True)，返回值： X: [n_samples, n_features]是特征矩阵的大小， y: [n_maples]是返回的标签数据，每条数据的标签。

# 糖尿病实例 import package
from sklearn import datasets, linear_model
# Load the diabetes dataset 
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y = True)
# Size of dataset
diabetes_X.shape

生成服从特定概率分布的数据集

1. 单标签数据集

make_blobs产生多类数据集，对每个类的中心和标准差有很好的控制

输入参数：sklearn.datasets.make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None)[source]

参数	类别	默认值	说明
n_samples	int类型	可选参数 (default=100)	总的点数，平均的分到每个clusters中
n_features	int类型	可选参数 (default=2)	每个样本的特征维度
centers	int类型 or 聚类中心坐标元组构成的数组类型	可选参数(default=3)	产生的中心点的数量, or 固定中心点位置
cluster_std	float or floats序列	可选参数 (default=1.0)	clusters的标准差
center_box	一对floats (min, max)	可选参数 (default=(-10.0, 10.0))	随机产生数据的时候，每个cluster中心的边界
shuffle	boolean	可选参数 (default=True)	打乱样本
random_state	int/ RandomState对象 / None	可选参数（default=None）	如果是int,random_state作为随机数产生器的seed; 如果是RandomState对象, random_state是随机数产生器; 如果是None, RandomState 对象是随机数产生器通过np.random

实例：产生两类样本点，两个聚类中心，坐标为(-3, 3)和(3, 3)；方差为0.5和0.7；样本点有1K个，每个点的纬度为2.

from sklearn.datasets.samples_generator import make_blobs
centers = [(-3, -3),(3, 3)]
cluster_std = [0.5,0.7]
X,y = make_blobs(n_samples=1000, centers=centers,n_features=2, random_state=0, cluster_std=cluster_std)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1 )
plt.scatter(X[:,0] , X[:,1],  c = y, alpha = 0.7)
plt.subplot(1, 2, 2)
plt.hist(y)
plt.show()

实例：产生3类样本点，3个距离中心，方差分别为0.5，0.7， 0.5，2K个样本点。

from sklearn.datasets.samples_generator import make_blobs
centers = [(-3, -3),(0,0),(3, 3)]
cluster_std = [0.5,0.7,0.5]
X,y = make_blobs(n_samples=2000, centers=centers,n_features=2, random_state=0, cluster_std=cluster_std)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1 )
plt.scatter(X[:,0] , X[:,1],  c = y, alpha = 0.7)
plt.subplot(1, 2, 2)
plt.hist(y)
plt.show()

2. 有噪点的数据集

make_blobs和make_classification都通过为每个类分配一个或多个正态分布的点簇来创建多类数据集。make_blobs提供了有关每个聚类的中心和标准偏差，便于聚类。 make_classification专门通过以下方式引入噪声：相关，冗余和非信息性特征；每个类别有多个高斯聚类；以及通过特征空间的线性变换。

输入参数：sklearn.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)

返回值：* X: array of shape [n_sample, n_features] 特征矩阵
* y: array of shape [n_sample] 矩阵每一行的整数类型标签

参数	类型	默认值	说明
n_samples	int类型	可选 (default=100)	样本数量
n_features	int	可选 (default=20)	总的特征数量,是从有信息的数据点，冗余数据点，重复数据点，和特征点-有信息的点-冗余的点-重复点中随机选择的
n_informative	int	optional (default=2)	informative features数量
n_redundant	int	optional (default=2)	redundant features数量
n_repeated	int	optional (default=0)	duplicated features数量
n_classes	int	optional (default=2)	类别或者标签数量
n_clusters_per_class	int	optional (default=2)	每个class中cluster数量
weights	floats列表 or None	(default=None)	每个类的权重，用于分配样本点
flip_y	float	optional (default=0.01)	随机交换样本的一段
class_sep	float	optional (default=1.0)	超立方体维数乘积
hypercube	boolean	optional (default=True)	如果为Ture则聚类到超立方体的顶点上. 如果为False则随机放置
shift	float,array of shape [n_features] or None	optional (default=0.0)	按照指定的值改变特征. 如果是None，特征就在[-class_sep,class_sep]中随机选取值来改变
scale	float array of shape [n_features] or None	optional (default=1.0)	将特征值乘以指定值. 如果是None,特征就乘以[1,100]内的随机值. 注意特征值放缩是在改变之后
shuffle	boolean	optional (default=True)	随机排列样本和特征
random_state	int,RandomState instance or None	optional (default=None)	如果是int型,random_state由随机生成器的随机种子生成; 如果是RandomState instance,random_state 是随机生成; 如果是None,随机由np.random随机产生.

from sklearn.datasets.samples_generator import make_classification
X,y = make_classification(n_samples=2000, n_features=10, n_informative=4, n_classes=4, random_state=0)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1 )
plt.scatter(X[:,0] , X[:,1],  c = y, alpha = 0.7)
plt.subplot(1, 2, 2)
plt.hist(y)
plt.show()

3. 高斯分布数据集

输入参数：sklearn.datasets.make_gaussian_quantiles(mean=None, cov=1.0, n_samples=100, n_features=2, n_classes=3, shuffle=True, random_state=None)

参数	类型	默认	说明
mean	array of shape [n_features]	optional (default=None)	指定多维正态分布的均值. 如果是None就使用原点(0, 0, …).
cov	float	optional (default=1.)	cov乘以单位矩阵产生协方差矩阵. 该数据集仅产生对称正态分布
n_samples	int	optional (default=100)	注意样本点在各个类之间均匀分布
n_features	int	optional (default=2)	每个样本的特征数量.
n_classes	int	optional (default=3)	分类数
shuffle	boolean	optional (default=True)	更改样本
random_state	int, RandomState instance or None	optional (default=None)	如果是int型,random_state由随机生成器的随机种子生成; 如果是RandomState instance,random_state 是随机生成; 如果是None,随机由np.random随机产生.

from sklearn.datasets.samples_generator import make_gaussian_quantiles
X,y = make_gaussian_quantiles(mean=(1,1), cov=1.0, n_samples=1000, n_features=2, n_classes=2, shuffle=True, random_state=None)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1 )
plt.scatter(X[:,0] , X[:,1],  c = y, alpha = 0.7)
plt.subplot(1, 2, 2)
plt.hist(y)
plt.show()

4. 产生二分类数据

make_hastie_10_2产生二分类数据

参数	类型	默认	说明
n_samples	int	optional (default=12000)	样本数量
random_state	int, RandomState instance or None	optional (default=None)	如果是int型,random_state由随机生成器的随机种子生成; 如果是RandomState instance,random_state 是随机生成; 如果是None,随机由np.random随机产生.

from sklearn.datasets.samples_generator import make_hastie_10_2
X,y = make_hastie_10_2(n_samples=1000, random_state=None)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1 )
plt.scatter(X[:,0] , X[:,1],  c = y, alpha = 0.7)
plt.subplot(1, 2, 2)
plt.hist(y)
plt.show()

其他数据集

图片数据集

scikit-learn自带一组JPEG格式的图片，可用于测试2D数据的算法。

导入图片	简介
load_sample_images()	导入样本图片，用于加载自带的2个图片
load_sample_images(image_name)	导入单个图片，返回numpy数组，用于加载外部图片

图片的默认uint-8格式。如果首先将输入转换为浮点表示形式，则机器学习算法效果最好。另外，如果您打算使用matplotlib.pyplpt.imshow，请不要忘记将其缩放到0-1。

# Authors: Robert Layton <robertlayton@gmail.com>
#          Olivier Grisel <olivier.grisel@ensta.org>
#          Mathieu Blondel <mathieu@mblondel.org>
#
# License: BSD 3 clause

print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin
from sklearn.datasets import load_sample_image
from sklearn.utils import shuffle
from time import time

n_colors = 64

# Load the Summer Palace photo
china = load_sample_image("china.jpg")

# Convert to floats instead of the default 8 bits integer coding. Dividing by
# 255 is important so that plt.imshow behaves works well on float data (need to
# be in the range [0-1])
china = np.array(china, dtype=np.float64) / 255

# Load Image and transform to a 2D numpy array.
w, h, d = original_shape = tuple(china.shape)
assert d == 3
image_array = np.reshape(china, (w * h, d))

print("Fitting model on a small sub-sample of the data")
t0 = time()
image_array_sample = shuffle(image_array, random_state=0)[:1000]
kmeans = KMeans(n_clusters=n_colors, random_state=0).fit(image_array_sample)
print("done in %0.3fs." % (time() - t0))

# Get labels for all points
print("Predicting color indices on the full image (k-means)")
t0 = time()
labels = kmeans.predict(image_array)
print("done in %0.3fs." % (time() - t0))


codebook_random = shuffle(image_array, random_state=0)[:n_colors]
print("Predicting color indices on the full image (random)")
t0 = time()
labels_random = pairwise_distances_argmin(codebook_random,
                                          image_array,
                                          axis=0)
print("done in %0.3fs." % (time() - t0))


def recreate_image(codebook, labels, w, h):
    """Recreate the (compressed) image from the code book & labels"""
    d = codebook.shape[1]
    image = np.zeros((w, h, d))
    label_idx = 0
    for i in range(w):
        for j in range(h):
            image[i][j] = codebook[labels[label_idx]]
            label_idx += 1
    return image

# Display all results, alongside original image
plt.figure(1)
plt.clf()
plt.axis('off')
plt.title('Original image (96,615 colors)')
plt.imshow(china)

plt.figure(2)
plt.clf()
plt.axis('off')
plt.title('Quantized image (64 colors, K-Means)')
plt.imshow(recreate_image(kmeans.cluster_centers_, labels, w, h))

plt.figure(3)
plt.clf()
plt.axis('off')
plt.title('Quantized image (64 colors, Random)')
plt.imshow(recreate_image(codebook_random, labels_random, w, h))
plt.show()

svmlight/libsvm格式的数据集

SVMlight是实现半监督SVM的一个工具包,LIBSVM是台湾大学林智仁(Lin Chih-Jen)教授等开发设计的一个简单、易于使用和快速有效的SVM模式识别与回归的软件包.

from sklearn.datasets import load_svmlight_file
# 加载数据集
X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")
# 加载多个数据集
X_train, y_train, X_test, y_test = load_svmlight_files(("/path/to/train_dataset.txt", "/path/to/test_dataset.txt"))
# 保证X_test, y_test有相同的特征值
X_test, y_test = load_svmlight_file("/path/to/test_dataset.txt", n_features=X_train.shape[1])

从openml.org下载数据集

openml.org是一个公开的机器学习网站。

from sklearn.datasets import fetch_openml
mice = fetch_openml(name='miceprotein', version=4)
print(mice.data.shape) #(1080, 77)
print(mice.target.shape) #(1080,)
print(np.unique(mice.target)) #'c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s'
print(mice.DESCR) 
print(mice.details) 
print(mice.url)on
 from sklearn.datasets import fetch_openml
 mice = fetch_openml(name=‘miceprotein’, version=4)
 print(mice.data.shape) #(1080, 77)
 print(mice.target.shape) #(1080,)
 print(np.unique(mice.target)) #‘c-CS-m’, ‘c-CS-s’, ‘c-SC-m’, ‘c-SC-s’, ‘t-CS-m’, ‘t-CS-s’, ‘t-SC-m’, ‘t-SC-s’
 print(mice.DESCR)
 print(mice.details)
 print(mice.url)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：ceres求导

下一篇：主机IP不是同网段配置Keepalived

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯