概述

逻辑回归是一种分类算法,并不是线性回归的子类

sklearn:使用逻辑回归诊断乳腺癌

使用sklearn自带的乳腺癌数据,逻辑回归用来诊断是阴性还是阳性。这个数据集有569个样本,每个样本有30个特征,共357的阳性(y=1)样本,212个阴性(y=0)样本.
本例子使用90%的例子做训练,10%的例子做测试。

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
import time
import numpy as np

cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
print('data sgape:{0}, no. positive:{1}, no. negative:{2}'.format(X.shape,
                                                                  y[y==1].shape[0],
                                                                  y[y==0].shape[0]))
print(cancer.data[0])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=3)

model = LogisticRegression()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
cv_score = model.score(X_test, y_test)

print('train_score:{0:.6f}, cv_score:{1:.6f}'.format(train_score, cv_score))

y_pre = model.predict(X_test)
y_pre_proba = model.predict_proba(X_test)

print('matchs:{0}/{1}'.format(np.equal(y_pre, y_test).shape[0], y_test.shape[0]))
print('y_pre:{}, \ny_pre_proba:{}'.format(y_pre, y_pre_proba))

最后预测的时候有两种方法:predict和predict_proba。前者得到可能性最大的分类,后者对每个数据预测两个值,即分别对应阳性和阴性的概率。

上面程序的输出为:

data sgape:(569, 30), no. positive:357, no. negative:212
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]
train_score:0.955078, cv_score:0.929825
matchs:57/57
y_pre:[1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1
 1 0 0 0 1 1 0 1 1 1 1 0 1 0 1 1 0 0 1 1], 
y_pre_proba:[[8.21202992e-03 9.91787970e-01]
 [4.04380984e-01 5.95619016e-01]
 [2.42761767e-02 9.75723823e-01]
 [1.77815187e-04 9.99822185e-01]
 [9.99997954e-01 2.04594259e-06]
 [3.00971068e-03 9.96990289e-01]
 [1.63342687e-02 9.83665731e-01]
 [3.09411958e-04 9.99690588e-01]
 [1.71770493e-03 9.98282295e-01]
 [1.47130324e-03 9.98528697e-01]
.....

 [6.32776963e-02 9.36722304e-01]
 [1.77330319e-01 8.22669681e-01]
 [9.76807864e-01 2.31921358e-02]
 [4.93697088e-03 9.95063029e-01]
 [9.99999997e-01 2.51495507e-09]
 [7.23925608e-02 9.27607439e-01]
 [7.50731975e-02 9.24926802e-01]
 [9.99561579e-01 4.38421213e-04]
 [9.99999740e-01 2.59726453e-07]
 [1.36884160e-03 9.98631158e-01]
 [4.93115526e-04 9.99506884e-01]]

可以看到,虽然在测试集上全部预测对了,但是score却不是1.这是因为对逻辑回归的score计算是根据概率来的,不是最终的分类。

使用多项式优化逻辑回归的预测结果

和之前线性回归类似,也可以在逻辑回归中使用多项式来提高模型的效果。

除了多项式,一并测试L1正则化和L2正则化。测试1次多项式和2次多项式。组合起来共四种情况。

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
import time
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

def polynomial_model(degree=1, **kwargs):
    polynomial_features = PolynomialFeatures(degree=degree, include_bias=False)
    logostic_regression = LogisticRegression(**kwargs)
    pipeline = Pipeline([('polynomial_features', polynomial_features),
                         ('logostic_regression', logostic_regression)])
    return pipeline

cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
print('data shape:{0}, no. positive:{1}, no. negative:{2}'.format(X.shape,
                                                                  y[y==1].shape[0],
                                                                  y[y==0].shape[0]))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)

penaltys = ['l1', 'l2']
for degree in range(1, 3):
    for penalty in penaltys:
        model = polynomial_model(degree=degree, penalty=penalty)
        model.fit(X_train, y_train)

        train_score = model.score(X_train, y_train)
        cv_score = model.score(X_test, y_test)

        print('penalty:{}, degree:{}, train_score:{:.6f}, cv_score:{:.6f}'.format(penalty, degree, train_score, cv_score))

测试结果为:

data shape:(569, 30), no. positive:357, no. negative:212
penalty:l1, degree:1, train_score:0.960440, cv_score:0.929825
penalty:l2, degree:1, train_score:0.958242, cv_score:0.921053
penalty:l1, degree:2, train_score:0.995604, cv_score:0.973684
penalty:l2, degree:2, train_score:0.969231, cv_score:0.938596

可以看到L1正则化加上二阶多项式的组合是最好的。

代码里的变量model还有一个成员coef_,这个属性里保存的就是模型的参数。增加二阶多项式后,特征由30个增加到495个。但是增加正则项后,这些模型参数大部分都是0。

正则化

过拟合就是在训练样本上代价函数的值很小,在新的数据上代价函数值很大。泛化能力差

防止过拟合的方法主要是减少特征数量正则化

正则化的基本思想就是,参数值越小,对应的假设模型越简单,也越不容易过拟合。

在使用梯度下降求解假设函数参数的过程中,通过加上一个参数的L1范数或者L2范数的正则项,可以起到正则化的作用。

  • L1范数作为正则项,会让模型参数稀疏化,即让参数向量里为0的元素尽量多。
  • L2范数作为正则项,则是让模型参数尽量小,但不会为0。

大部分情况下解决过拟合问题,还是选择L2范数为正则项,这也是sklearn的默认值。

其它参数

在sklearn中,正则化权重参数C于公式里的1/λ. C越大,正则化权重越小,越容易出现过拟合;C越小,正则化权重越大,模型容易出现欠拟合。