1.OverFitting

在模型训练过程中,过拟合overfitting是非常常见的现象。所谓的overfitting,就是在训练集上表现很好,但是测试集上表现很差。为了减少过拟合,提高模型的泛化能力,实际中会有很多措施来缓解overfitting的问题。其中一个常见的方法就是将已有数据集中保留一部分数据作为测试集,即将原有数据分为X_train, X_test,X_train用来训练模型,X_test用来验证调整模型。

2.train_split_test

sklearn中的train_split_test,是我们常用来划分训练集与测试集的方式。train_split_test方法按照指定的test_size,随机将数据划分成训练集与测试集两部分。与我们今天要讲的Cross Validation区别在于,train_split_test只对数据进行了一次划分。

简单写个测试代码

def train_split_test_demo():
    from sklearn.model_selection import train_test_split
    import numpy as np

    x = np.random.randint(1, 100, 20).reshape(10, 2)
    y = np.random.randint(0, 2, 10).reshape(10, 1)

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
    print("x_train is: ", x_train)
    print("y_train is: ", y_train)
    print("x_test is: ", x_test)
    print("y_test is: ", y_test)


train_split_test_demo()

最后的结果为

x_train is:  [[39 29]
 [65 32]
 [51 29]
 [19 64]
 [94 98]
 [42 26]
 [13 63]
 [95 52]]
y_train is:  [[1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]]
x_test is:  [[20 91]
 [55 32]]
y_test is:  [[1]
 [1]]

3.Exhaustive Cross Validation(彻底交叉验证)

彻底交叉验证是遍历数据全集X的所有非空真子集A。假设X中的数据条数是n,那么非空真子集的个数是knn交叉验证出错 交叉验证得分_stratifiedkfold,可见其复杂度为指数级,实际中显然是不可接受的。

4.Leave-p-out Cross Validation(留P验证)

留P验证是指将数据全集X中留出p个元素当测试集,n-p个元素作为训练集。根据排列组合知识,我们很容易可以计算出其组合的方式一共有knn交叉验证出错 交叉验证得分_交叉验证_02。如果p=1,不难看出起复杂度刚好为n。根据上面的计算公式不难看出,起复杂度也非常高,实际中也不可接受。

5.KFold

实际中我们使用最多的还是KFold Cross Validation。K折交叉验证会把样本数据随机的分成K份(一般是均分),每次随机的选择K-1份作为训练集,剩下的1份做测试集。当这一轮完成后,重新随机选择K-1份来训练数据。最后我们来选择最优的模型以及参数。

对于KFold Cross Validation,能减少train_split_test一次随机划分带来的偶然性,最终提高模型的泛化能力。

class KFold(_BaseKFold):
    """K-Folds cross-validator

    Provides train/test indices to split data in train/test sets. Split
    dataset into k consecutive folds (without shuffling by default).

    Each fold is then used once as a validation while the k - 1 remaining
    folds form the training set.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    n_splits : int, default=5
        Number of folds. Must be at least 2.

        .. versionchanged:: 0.22
            ``n_splits`` default value changed from 3 to 5.

    shuffle : boolean, optional
        Whether to shuffle the data before splitting into batches.

    random_state : int, RandomState instance or None, optional, default=None
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`. Only used when ``shuffle`` is True. This should be left
        to None if ``shuffle`` is False.

KFold的注释就很清楚地说明了起工作原理。
同样写段代码来进行测试。

def kfoldtest():
    from sklearn.model_selection import KFold
    import numpy as np

    kfold = KFold(n_splits=5)
    x = np.random.randint(1, 100, 10).reshape(5, 2)
    y = np.random.randint(0, 2, 5).reshape(5, 1)

    for train_index, test_index in kfold.split(x, y):
        print("index is: ", train_index, test_index)
        print("train data is: ", x[train_index], y[train_index])
        print("test data is: ", x[test_index], y[test_index])
        print()

kfoldtest()
index is:  [1 2 3 4] [0]
train data is:  [[87 84]
 [63 12]
 [73  8]
 [ 1 96]] [[1]
 [0]
 [1]
 [1]]
test data is:  [[92  7]] [[1]]

index is:  [0 2 3 4] [1]
train data is:  [[92  7]
 [63 12]
 [73  8]
 [ 1 96]] [[1]
 [0]
 [1]
 [1]]
test data is:  [[87 84]] [[1]]

index is:  [0 1 3 4] [2]
train data is:  [[92  7]
 [87 84]
 [73  8]
 [ 1 96]] [[1]
 [1]
 [1]
 [1]]
test data is:  [[63 12]] [[0]]

index is:  [0 1 2 4] [3]
train data is:  [[92  7]
 [87 84]
 [63 12]
 [ 1 96]] [[1]
 [1]
 [0]
 [1]]
test data is:  [[73  8]] [[1]]

index is:  [0 1 2 3] [4]
train data is:  [[92  7]
 [87 84]
 [63 12]
 [73  8]] [[1]
 [1]
 [0]
 [1]]
test data is:  [[ 1 96]] [[1]]

如果我们点进行查看split方法源码

def split(self, X, y=None, groups=None):
        """Generate indices to split data into training and test set.

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            Training data, where n_samples is the number of samples
            and n_features is the number of features.

        y : array-like, shape (n_samples,)
            The target variable for supervised learning problems.

        groups : array-like, with shape (n_samples,), optional
            Group labels for the samples used while splitting the dataset into
            train/test set.

        Yields
        ------
        train : ndarray
            The training set indices for that split.

        test : ndarray
            The testing set indices for that split.
        """

可以看出来,split方法返回的是个生成器,生成器里头保存的是训练集与测试集的索引(indices)。

6.StratifiedKFold

对于KFold划分来说,有个问题在于其划分是完全随机的。实际中很多场景样本都不是平衡的,比如CTR预估这种典型场景,基本都是非平衡样本,正样本很少,绝大部分都为负样本。如果是KFold划分,很容易导致某一折或几折都是负例没有正例。因此非平衡数据可以用分层采样StratifiedKFold,StratifiedKFold会使每一份子集中都保持和原始数据集相同的类别比例。

class StratifiedKFold(_BaseKFold):
    """Stratified K-Folds cross-validator

    Provides train/test indices to split data in train/test sets.

    This cross-validation object is a variation of KFold that returns
    stratified folds. The folds are made by preserving the percentage of
    samples for each class.

    Read more in the :ref:`User Guide <cross_validation>`.

最关键的解释就是这一句
The folds are made by preserving the percentage of samples for each class.

def stratifiedtest():
    from sklearn.model_selection import StratifiedKFold
    import numpy as np

    x = np.random.randint(1, 100, 20).reshape(10, 2)
    y = np.random.randint(0, 2, 10).reshape(10, 1)
    skfold = StratifiedKFold(n_splits=5, random_state=0)
    for train_index, test_index in skfold.split(x, y):
        print("index is: ", train_index, test_index)
        print("train data is: ", x[train_index], y[train_index])
        print("test data is: ", x[test_index], y[test_index])
        print()

stratifiedtest()
index is:  [1 3 4 5 6 7 8 9] [0 2]
train data is:  [[99 64]
 [29 77]
 [ 6 46]
 [88 34]
 [96 82]
 [93  2]
 [80 13]
 [14 49]] [[1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]]
test data is:  [[93 52]
 [ 2 62]] [[0]
 [0]]

index is:  [0 2 4 5 6 7 8 9] [1 3]
train data is:  [[93 52]
 [ 2 62]
 [ 6 46]
 [88 34]
 [96 82]
 [93  2]
 [80 13]
 [14 49]] [[0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]]
test data is:  [[99 64]
 [29 77]] [[1]
 [0]]

index is:  [0 1 2 3 5 6 8 9] [4 7]
train data is:  [[93 52]
 [99 64]
 [ 2 62]
 [29 77]
 [88 34]
 [96 82]
 [80 13]
 [14 49]] [[0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]]
test data is:  [[ 6 46]
 [93  2]] [[1]
 [0]]

index is:  [0 1 2 3 4 6 7 9] [5 8]
train data is:  [[93 52]
 [99 64]
 [ 2 62]
 [29 77]
 [ 6 46]
 [96 82]
 [93  2]
 [14 49]] [[0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]]
test data is:  [[88 34]
 [80 13]] [[1]
 [0]]

index is:  [0 1 2 3 4 5 7 8] [6 9]
train data is:  [[93 52]
 [99 64]
 [ 2 62]
 [29 77]
 [ 6 46]
 [88 34]
 [93  2]
 [80 13]] [[0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]]
test data is:  [[96 82]
 [14 49]] [[1]
 [0]]

7.cross_validate

最后终于来到了cross_validate

def cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=None,
                   n_jobs=None, verbose=0, fit_params=None,
                   pre_dispatch='2*n_jobs', return_train_score=False,
                   return_estimator=False, error_score=np.nan):
    """Evaluate metric(s) by cross-validation and also record fit/score times.

    Read more in the :ref:`User Guide <multimetric_cross_validation>`.

    Parameters
    ----------
    estimator : estimator object implementing 'fit'
        The object to use to fit the data.

    X : array-like
        The data to fit. Can be for example a list, or an array.

    y : array-like, optional, default: None
        The target variable to try to predict in the case of
        supervised learning.

    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set. Only used in conjunction with a "Group" :term:`cv`
        instance (e.g., :class:`GroupKFold`).

    scoring : string, callable, list/tuple, dict or None, default: None
        A single string (see :ref:`scoring_parameter`) or a callable
        (see :ref:`scoring`) to evaluate the predictions on the test set.

        For evaluating multiple metrics, either give a list of (unique) strings
        or a dict with names as keys and callables as values.

        NOTE that when using custom scorers, each scorer should return a single
        value. Metric functions returning a list/array of values can be wrapped
        into multiple scorers that return one value each.

        See :ref:`multimetric_grid_search` for an example.

        If None, the estimator's score method is used.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:

        - None, to use the default 5-fold cross validation,
        - integer, to specify the number of folds in a `(Stratified)KFold`,
        - :term:`CV splitter`,
        - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if the estimator is a classifier and ``y`` is
        either binary or multiclass, :class:`StratifiedKFold` is used. In all
        other cases, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validation strategies that can be used here.

        .. versionchanged:: 0.22
            ``cv`` default value if None changed from 3-fold to 5-fold.
      ...............

先写段代码简单测试

def cvtest():
    iris = datasets.load_iris()
    clf = svm.SVC(kernel='linear', C=1)
    scores = cross_val_score(clf, iris.data, iris.target, cv=5)
    print('kfold=5的分数为: ', scores)

    scoring = ['precision_macro', 'recall_macro']
    scores = cross_validate(clf, iris.data, iris.target, scoring=scoring, cv=5, return_train_score=True)
    sorted(scores.keys())
    print('测试结果:', scores)  # scores类型为字典。包含训练得分,拟合次数, score-times (得分次数)

cvtest()
kfold=5的分数为:  [0.96666667 1.         0.96666667 0.96666667 1.        ]
测试结果: {'fit_time': array([0.00034523, 0.00037003, 0.00032854, 0.00036478, 0.00033736]), 'score_time': array([0.00103903, 0.00091934, 0.00091314, 0.000916  , 0.00120878]), 'test_precision_macro': array([0.96969697, 1.        , 0.96969697, 0.96969697, 1.        ]), 'train_precision_macro': array([0.97674419, 0.97674419, 0.99186992, 0.98412698, 0.98333333]), 'test_recall_macro': array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ]), 'train_recall_macro': array([0.975     , 0.975     , 0.99166667, 0.98333333, 0.98333333])}

注意里面的cv参数
int, cross-validation generator or an iterable

For integer/None inputs, if the estimator is a classifier and ``y`` 
is either binary or multiclass, :class:`StratifiedKFold` is used. In all
other cases, :class:`KFold` is used.

如果cv传入的是整数,或者没传,同时estimator是个分类器,y是二分类或者多分类,这个时候cv使用的是StratifiedKFold,其他情况使用的则是KFold。