1、简介

scikit-learn是一个建立在Scipy基础上的用于机器学习的python模块,而在不同的领域中已经发展出为数众多的基于Scipy的工具包,它们被统一称为Scikits,而在所有的分支版本中,scikit-learn是最有名的。它是开源的,任何人都可以免费地使用它或者进行二次发行。

scikit-learn包含众多定级机器学习算法,它主要有6大类的基本功能,分别是分类,回归,聚类,数据降维,模型选择和数据预处理。

机器学习官方API链接

sklearn dataset 模块学习

2、重点函数讲解

sklearn.datasets.make_blobs(n_samples=100n_features=2centers=Nonecluster_std=1.0center_box=(-10.010.0)shuffle=Truerandom_state=None)[source]   此函数常用来生成测试数据集

Generate isotropic Gaussian blobs for clustering.

Read more in the User Guide.

Parameters:

n_samples : int or array-like, optional (default=100)

If int, it is the total number of points equally divided among clusters. If array-like, each element of the sequence indicates the number of samples per cluster.

n_features : int, optional (default=2)

The number of features for each sample.代表每个物体的特性数,可以决定输出X中的列数

centers : int or array of shape [n_centers, n_features], optional

(default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is array-like, centers must be either None or an array of length equal to the length of n_samples.表示生成数据在图上绘制出几个集合

cluster_std : float or sequence of floats, optional (default=1.0)

The standard deviation of the clusters.生成数据的标准差大小,标准差越大,则数据点越离散,否则则相反,默认给标准差大小为1,与默认给的center_box的比较合适,如果想调整这个大小,则与之相对应的center_box大小成成正比调整,到时绘制的点离散度比较合适,不然就会造成生成数据的点过于离散或者过于聚合

center_box : pair of floats (min, max), optional (default=(-10.0, 10.0))

The bounding box for each cluster center when centers are generated at random.调整生成数据的边界值

shuffle : boolean, optional (default=True)

Shuffle the samples.相当于打乱顺序

random_state : int, RandomState instance or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.设置生成数据的随机值,如果想控制每次产生的数据值是一样的,则使用这个参数传递一个合适的随机值,可以保证每次生成的数据值都一样,有利于重复试验;如果不传递随机值,则每次生成的数据则不一样;其余的函数传递的随机值含义也一样

Returns:

X : array of shape [n_samples, n_features]

The generated samples.生成数据的shape为(n_sample,centers)

y : array of shape [n_samples]

The integer labels for cluster membership of each sample.如果选择的特性数n,则生成数据值由0到n-1一维数组组成

# 使用示例
X, y = make_blobs(n_samples=100, n_features=2, centers=2, random_state=0, cluster_std=1.0)

 sklearn.model_selection.train_test_split(*arrays**options)[source]   交叉生成训练数据集和测试数据集的函数

Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

Read more in the User Guide.

Parameters:

*arrays : sequence of indexables with same length / shape[0]  

Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.  传入生成数据集,X,y

test_size : float, int or None, optional (default=0.25)  现在推荐使用test_size而不是train_size;指定划分数据集中测试数据集所占的比率

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. By default, the value is set to 0.25. The default will change in version 0.21. It will remain 0.25 only if train_size is unspecified, otherwise it will complement the specified train_size.

train_size : float, int, or None, (default=None) 指定划分训练数据集的比率,与test_size可以同时使用,但是同时使用的比较少

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_state : int, RandomState instance or None, optional (default=None)   指定随机划分时的随机种子,如果想要划分的数据集每次都一样的话,就指定一个随机值参数

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

shuffle : boolean, optional (default=True)  打乱数据集

Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

stratify : array-like or None (default=None) 一般传递y数组值,按照y中各类数据的比例分配给train和test

If not None, data is split in a stratified fashion, using this as the class labels.

Returns:

splitting : list, length=2 * len(arrays)

List containing train-test split of inputs.

New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

#使用示例: 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=10, stratify=y)

sklearn.datasets.make_regression(n_samples=100n_features=100n_informative=10n_targets=1bias=0.0effective_rank=Nonetail_strength=0.5noise=0.0shuffle=Truecoef=Falserandom_state=None)[source]  生成回归预测数据集

Generate a random regression problem.

The input set can either be well conditioned (by default) or have a low rank-fat tail singular profile. See make_low_rank_matrix for more details.

The output is generated by applying a (potentially biased) random linear regression model with n_informative nonzero regressors to the previously generated input and some gaussian centered noise with some adjustable scale.

Read more in the User Guide.

Parameters:

n_samples : int, optional (default=100)  生成数据的个数

The number of samples.

n_features : int, optional (default=100)  生成数据的特性数

The number of features.

n_informative : int, optional (default=10)  生成数据参与建模的特性个数

The number of informative features, i.e., the number of features used to build the linear model used to generate the output.

n_targets : int, optional (default=1)  目标因变量的个数

The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.

bias : float, optional (default=0.0)  偏差(截距)

The bias term in the underlying linear model.

effective_rank : int or None, optional (default=None)

if not None:

The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to reproduce the correlations often observed in practice.

if None:

The input set is well conditioned, centered and gaussian with unit variance.

tail_strength : float between 0.0 and 1.0, optional (default=0.5)

The relative importance of the fat noisy tail of the singular values profile if effective_rank is not None.

noise : float, optional (default=0.0) 噪音值,也就是标准差

The standard deviation of the gaussian noise applied to the output.

shuffle : boolean, optional (default=True)

Shuffle the samples and the features.

coef : boolean, optional (default=False)  是否输出coef标识,默认不输出

If True, the coefficients of the underlying linear model are returned.

random_state : int, RandomState instance or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.

Returns:

X : array of shape [n_samples, n_features]

The input samples.

y : array of shape [n_samples] or [n_samples, n_targets]

The output values.

coef : array of shape [n_features] or [n_features, n_targets], optional

The coefficient of the underlying linear model. It is returned only if coef is True.

3、函数使用简要说明

sklearn相关函数

函数

使用说明

sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto')

n_neighbors调节临近点的个数,一般调整预测值需要用这个参数;weights调整权重,uniform表示初始权重全部一样的;algorithm更换训练算法,auto表示尝试选择一个最佳的算法进行预测

x_train, x_test, y_train, y_test = sklearn.cross_validation.train_test_split(x, y, test_size = 0.2,random_state=0)

将原始数据划分成训练数据集合测试数据集,根据test_size参数调整测试数据集合训练数据集的数据各占用总数据的比率