三种方法评估模型的预測质量:

最后介绍 ​Dummy estimators​ 。提供随机推測的策略,能够作为预測质量评价的baseline。

(參考第六小节)

See also

 

For “pairwise” metrics, between samples and not estimators or predictions, see the ​Pairwise metrics, Affinities and Kernels ​ section.


详细内容有时间再写。。


1、

The scoring parameter: defining model evaluation rules

Model selection and evaluation using tools, such as ​​grid_search.GridSearchCV​​ and ​​cross_validation.cross_val_score​​, take a scoring parameter that controls what metric they apply to the estimators evaluated.

1)提前定义的标准

全部的scorer都是越大越好。因此mean_absolute_error and mean_squared_error(測量预測点离模型的距离)是负值。

Scoring

Function

Comment

Classification

 

 

‘accuracy’

​metrics.accuracy_score​

 

‘average_precision’

​metrics.average_precision_score​

 

‘f1’

​metrics.f1_score​

for binary targets

‘f1_micro’

​metrics.f1_score​

micro-averaged

‘f1_macro’

​metrics.f1_score​

macro-averaged

‘f1_weighted’

​metrics.f1_score​

weighted average

‘f1_samples’

​metrics.f1_score​

by multilabel sample

‘log_loss’

​metrics.log_loss​

requires predict_proba support

‘precision’ etc.

​metrics.precision_score​

suffixes apply as with ‘f1’

‘recall’ etc.

​metrics.recall_score​

suffixes apply as with ‘f1’

‘roc_auc’

​metrics.roc_auc_score​

 

Clustering

 

 

‘adjusted_rand_score’

​metrics.adjusted_rand_score​

 

Regression

 

 

‘mean_absolute_error’

​metrics.mean_absolute_error​

 

‘mean_squared_error’

​metrics.mean_squared_error​

 

‘median_absolute_error’

​metrics.median_absolute_error​

 

‘r2’

​metrics.r2_score​

 

给个样例:


>>> from sklearn import svm, cross_validation, datasets
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> model = svm.SVC()
>>> cross_validation.cross_val_score(model, X, y, scoring='wrong_choice')
Traceback (most recent call last):
ValueError: 'wrong_choice' is not a valid scoring value. Valid options are ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc']
>>> clf = svm.SVC(probability=True, random_state=0)
>>> cross_validation.cross_val_score(clf, X, y, scoring='log_loss')
array([-0.07..., -0.16..., -0.06...])


3)自己定义scoring标准

following two rules:

  • It can be called with parameters (estimator, X, y), where estimator is the model that should be evaluated, X is validation data, and y is the ground truth target for X (in the supervised case) or None (in the unsupervised case).
  • It returns a floating point number that quantifies the estimator prediction quality on X, with reference to y. Again, by convention higher numbers are better, so if your scorer returns loss, that value should be negated.






2、

Classification metrics

The ​​sklearn.metrics​​ module implements several loss, score, and utility functions to measure classification performance.



Some of these are restricted to the binary classification case:

​matthews_corrcoef​​(y_true, y_pred)

Compute the Matthews correlation coefficient (MCC) for binary classes

​precision_recall_curve​​(y_true, probas_pred)

Compute precision-recall pairs for different probability thresholds

​roc_curve​​(y_true, y_score[, pos_label, ...])

Compute Receiver operating characteristic (ROC)

Others also work in the multiclass case:

​confusion_matrix​​(y_true, y_pred[, labels])

Compute confusion matrix to evaluate the accuracy of a classification

​hinge_loss​​(y_true, pred_decision[, labels, ...])

Average hinge loss (non-regularized)

Some also work in the multilabel case:

​accuracy_score​​(y_true, y_pred[, normalize, ...])

Accuracy classification score.

​classification_report​​(y_true, y_pred[, ...])

Build a text report showing the main classification metrics

​f1_score​​(y_true, y_pred[, labels, ...])

Compute the F1 score, also known as balanced F-score or F-measure

​fbeta_score​​(y_true, y_pred, beta[, labels, ...])

Compute the F-beta score

​hamming_loss​​(y_true, y_pred[, classes])

Compute the average Hamming loss.

​jaccard_similarity_score​​(y_true, y_pred[, ...])

Jaccard similarity coefficient score

​log_loss​​(y_true, y_pred[, eps, normalize, ...])

Log loss, aka logistic loss or cross-entropy loss.

​precision_recall_fscore_support​​(y_true, y_pred)

Compute precision, recall, F-measure and support for each class

​precision_score​​(y_true, y_pred[, labels, ...])

Compute the precision

​recall_score​​(y_true, y_pred[, labels, ...])

Compute the recall

​zero_one_loss​​(y_true, y_pred[, normalize, ...])

Zero-one classification loss.

And some work with binary and multilabel (but not multiclass) problems:

​average_precision_score​​(y_true, y_score[, ...])

Compute average precision (AP) from prediction scores

​roc_auc_score​​(y_true, y_score[, average, ...])

Compute Area Under the Curve (AUC) from prediction scores

In the following sub-sections, we will describe each of those functions, preceded by some notes on common API and metric definition.



2)accuracy score:

The ​​accuracy_score​​ function computes the ​ ​accuracy​​, 默认是计算预測正确的比例,假设设置normalize=False。计算预測正确的绝对数量。给个样例就明确:

>>> import numpy as np
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2


对于multilabel classification,仅仅有所有的labels所有预測对。该sample才算预測对。

给个样例就明确:

>>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.5

再參考:


3)confusion matrix:

The ​​confusion_matrix​​ function evaluates classification accuracy by computing the ​ ​confusion matrix​ ​. 给个样例:

>>> from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])

(注意:纵轴是true label,横轴是predict label)

再參考:


4)classification report:

The ​​classification_report​​ function builds a text report showing the main classification metrics. 给个样例:

>>> from sklearn.metrics import classification_report
>>> y_true = [0, 1, 2, 2, 0]
>>> y_pred = [0, 0, 2, 2, 0]
>>> target_names = ['class 0', 'class 1', 'class 2']
>>> print(classification_report(y_true, y_pred, target_names=target_names))
precision recall f1-score support

class 0 0.67 1.00 0.80 2
class 1 0.00 0.00 0.00 1
class 2 1.00 1.00 1.00 2

avg / total 0.67 0.80 0.72 5

再參考:



以下的一些不经常使用,简单列出来。不做过多解释和翻译:

5)hamming loss:

If  is the predicted value for the -th label of a given sample,  is the corresponding true value, and   is the number of classes or labels, then the Hamming loss   between two samples is defined as:

scikit-learn:3.3. Model evaluation: quantifying the quality of predictions_sed

6)jaccard similarity coefficient score:

The Jaccard similarity coefficient of the -th samples, with a ground truth label set  and predicted label set , is defined as

scikit-learn:3.3. Model evaluation: quantifying the quality of predictions_git_02

7)precision、recall、f-measures:

Several functions allow you to analyze the precision, recall and F-measures score:

​average_precision_score​​(y_true, y_score[, ...])

Compute average precision (AP) from prediction scores

​f1_score​​(y_true, y_pred[, labels, ...])

Compute the F1 score, also known as balanced F-score or F-measure

​fbeta_score​​(y_true, y_pred, beta[, labels, ...])

Compute the F-beta score

​precision_recall_curve​​(y_true, probas_pred)

Compute precision-recall pairs for different probability thresholds

​precision_recall_fscore_support​​(y_true, y_pred)

Compute precision, recall, F-measure and support for each class

​precision_score​​(y_true, y_pred[, labels, ...])

Compute the precision

​recall_score​​(y_true, y_pred[, labels, ...])

Compute the recall

Note that the ​​precision_recall_curve​​ function is restricted to the binary case. The ​ ​average_precision_score​​ function works only in binary classification and multilabel indicator format.

8)hinge loss:

9)log loss:

10)matthews correlation coefficient:

11)receiver operating characteristic(ROC):

12)zero one loss:

3、

Multilabel ranking metrics


In multilabel learning, each sample can have any number of ground truth labels associated with it. The goal is to give high scores and better rank to the ground truth labels.

1)coverage error:

2)label ranking average precision:



4、

Regression metrics

The ​​sklearn.metrics​​ module implements several loss, score, and utility functions to measure regression performance

Some of those have been enhanced to handle the multioutput case: ​​mean_absolute_error​​, ​​mean_squared_error​​, ​​median_absolute_error​​ and ​​r2_score​​.

1)explained variance score:

If  is the estimated target output,   the corresponding (correct) target output, and   is ​​Variance​​, the square of the standard deviation, then the explained variance is estimated as follow:

scikit-learn:3.3. Model evaluation: quantifying the quality of predictions_sed_03

2)mean absolute error:

If  is the predicted value of the  -th sample, and   is the corresponding true value, then the mean absolute error (MAE) estimated over   is defined as

scikit-learn:3.3. Model evaluation: quantifying the quality of predictions_ide_04

3)mean squared error:

If  is the predicted value of the  -th sample, and   is the corresponding true value, then the mean squared error (MSE) estimated over   is defined as

scikit-learn:3.3. Model evaluation: quantifying the quality of predictions_ide_05

4)R^2 score、the coefficient of determination:

If  is the predicted value of the  -th sample and   is the corresponding true value, then the score R² estimated over   is defined as

scikit-learn:3.3. Model evaluation: quantifying the quality of predictions_lua_06


5、

Clustering metrics

The ​​sklearn.metrics​​ module implements several loss, score, and utility functions. For more information see the ​ Clustering performance evaluation ​ section for instance clustering, and ​Biclustering evaluation ​ for biclustering.


6、Dummy estimators



对于supervised learning。使用随机产生的结果作为baseline是非常easy的对照。

​DummyClassifier​​提供了产生随机结果的简单的策略:


  • stratified generates random predictions by respecting the training set class distribution.
  • most_frequent always predicts the most frequent label in the training set.
  • uniform generates predictions uniformly at random.
  • constant always predicts a constant label that is provided by the user.(A major motivation of this method is F1-scoring, when the positive class is in the minority.)

Note that with all these strategies, the predict method completely ignores the input data!

给个简单样例:


 first let’s create an imbalanced dataset:



>>>

>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import train_test_split
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> y[y != 1] = -1
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


Next, let’s compare the accuracy of SVC and most_frequent:



>>>

>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.svm import SVC
>>> clf = SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.63...
>>> clf = DummyClassifier(strategy='most_frequent',random_state=0)
>>> clf.fit(X_train, y_train)
DummyClassifier(constant=None, random_state=0, strategy='most_frequent')
>>> clf.score(X_test, y_test)
0.57...


We see that SVC doesn’t do much better than a dummy classifier. Now, let’s change the kernel:



>>>

>>> clf = SVC(kernel='rbf', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.97...


同理,对于回归问题:

​DummyRegressor​​ also implements four simple rules of thumb for regression:


  • mean always predicts the mean of the training targets.
  • median always predicts the median of the training targets.
  • quantile always predicts a user provided quantile of the training targets.
  • constant always predicts a constant value that is provided by the user.

In all these strategies, the predict method completely ignores the input data.