最近笔者在学习机器学习中,遇到了“惩罚线性回归”模型的概念。这究竟是什么东西,然后发现如下

惩罚线性回归

线性回归可以理解为“拟合”,一般采用普通最小二乘方法OLS(ordinary least square),而最小二乘方法就是寻找某一参数,使得数据获得较好的曲线表示,一般采用的就是均方差mean square evolution(MSE)指标。

样条平滑函数回归 样条回归模型_线性回归

但是对于拟合问题存在一个过拟合的问题,如果数据回归过程中,拟合能力过强,显然在训练数据中会取得非常的学习效果,但是对于测试数据,效果就不会很好,这里我们说模型的泛化能力不强。为了避免这种现象,提出了惩罚线性回归模型,要求模型对于训练数据具有较好的学习能力,同时也要平衡系数参数的惯性能量。

什么叫较好的学习能力?

学习误差小,常用的MSE指标最小

什么叫系数参数的惯性能量?

估计出来的

样条平滑函数回归 样条回归模型_数据_02

各值应尽可能小,在Lyapunov能量理论中,可用欧几里得距离表示参数在多维空间中的固有能量(2-范数)。此处选择用欧氏距离L2(也有曼哈德距离L1)的平方和结果表示

样条平滑函数回归 样条回归模型_惩罚线性回归_03

———应用于系数的惩罚项(基于欧氏距离L2的岭惩罚)

样条平滑函数回归 样条回归模型_数据_04

———应用于系数的惩罚项(基于曼哈德距离L1的套索惩罚)

岭回归

岭回归作为惩罚线性回归的一种,其特点在于修改OLS的目标为

样条平滑函数回归 样条回归模型_线性回归_05

      后面项又称为岭惩罚

样条平滑函数回归 样条回归模型_数据_06

则在指标中同时考虑了较好的学习能力以及较小的惯性能量。那么问题就取决于如何设置alpha了。通过设置不同大小的alpha,获得最终的预测结果以及ROC曲线,通过比较ROC曲线下的面积(AUC)获得optimal alpha。AUC越大越好。另外一般就取alpha=0.85吧。

岭回归python代码:

__author__ = 'mike-bowles'
import urllib2
import numpy
from sklearn import datasets, linear_model
from sklearn.metrics import roc_curve, auc
import pylab as plt

#read data from uci data repository
target_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
data = urllib2.urlopen(target_url)

#arrange data into list for labels and list of lists for attributes
xList = []
labels = []
for line in data:
    #split on comma
    row = line.strip().split(",")
    #assign label 1.0 for "M" and 0.0 for "R"
    if(row[-1] == 'M'):
        labels.append(1.0)
    else:
        labels.append(0.0)
        #remove lable from row
    row.pop()
    #convert row to floats
    floatRow = [float(num) for num in row]
    xList.append(floatRow)

#divide attribute matrix and label vector into training(2/3 of data) and test sets (1/3 of data)
indices = range(len(xList))
xListTest = [xList[i] for i in indices if i%3 == 0 ]
xListTrain = [xList[i] for i in indices if i%3 != 0 ]
labelsTest = [labels[i] for i in indices if i%3 == 0]
labelsTrain = [labels[i] for i in indices if i%3 != 0]

#form list of list input into numpy arrays to match input class for scikit-learn linear model
xTrain = numpy.array(xListTrain); yTrain = numpy.array(labelsTrain); xTest = numpy.array(xListTest); yTest = numpy.array(labelsTest)

alphaList = [0.1**i for i in [-3, -2, -1, 0,1, 2, 3, 4, 5]]##拟选用的alpha值

aucList = []
for alph in alphaList:
    rocksVMinesRidgeModel = linear_model.Ridge(alpha=alph)
    rocksVMinesRidgeModel.fit(xTrain, yTrain)
    fpr, tpr, thresholds = roc_curve(yTest,rocksVMinesRidgeModel.predict(xTest))
    roc_auc = auc(fpr, tpr)
    aucList.append(roc_auc)#获得不同alpha值下的AUC大小


print("AUC             alpha")
for i in range(len(aucList)):
    print(aucList[i], alphaList[i])

#plot auc values versus alpha values
x = [-3, -2, -1, 0,1, 2, 3, 4, 5]
plt.plot(x, aucList)
plt.xlabel('-log(alpha)')
plt.ylabel('AUC')
plt.show()

#visualize the performance of the best classifier
indexBest = aucList.index(max(aucList))#获得最大AUC面积corresponding to index
alph = alphaList[indexBest]
rocksVMinesRidgeModel = linear_model.Ridge(alpha=alph)#建立岭回归模型
rocksVMinesRidgeModel.fit(xTrain, yTrain)

#scatter plot of actual vs predicted
plt.scatter(rocksVMinesRidgeModel.predict(xTest), yTest, s=100, alpha=0.25)
plt.xlabel("Predicted Value")
plt.ylabel("Actual Value")
plt.show()

如何求解惩罚线性回归问题

我们可以看到解决惩罚线性回归问题等效于求解一个多条件下的优化问题。可通过大量数值优化算法求解

样条平滑函数回归 样条回归模型_线性回归_07

       and       

样条平滑函数回归 样条回归模型_线性回归_08

优化求解问题。此处我们有最小角度回归LARS算法,算法结构紧凑理解简易,并以代码实现。批注:这个方法很聪明快速,他从0开始找,通过完成设置步数的迭代学习后,获得并保存均方根误差最小值结果,同时满足系数尽可能小,以实现惩罚项的目的。

LARS算法可以理解为一种改进的前向逐步回归算法。步骤如下

1、将所有值均初始化为0;

2、在每一步中:

a) 决定哪个属性与残差有最大的关联    b) 如果关联为正,小幅度增加关联系数,关联为负,小幅度减少关联系数

__author__ = 'mike-bowles'

import numpy
from sklearn import datasets, linear_model
from math import sqrt
import matplotlib.pyplot as plot
import csv

#read data
file='winequality-red.csv' #文本格式 直接open
##with open(file,'r') as f:
##    for i in f.readlines():
##        print(i)


xList = []
labels = []
names = []
firstLine = True
with open(file,'r') as f:
    for line in f.readlines():
        if firstLine:
            names = line.strip().split(";")
##            print(names)
            firstLine = False
        else:
            #split on semi-colon
            row = line.strip().split(";")
##            print(row)
            #put labels in separate array
            labels.append(float(row[-1]))
            #remove label from row
            row.pop()
            #convert row to floats
            floatRow = [float(num) for num in row]
            xList.append(floatRow)

##print(labels)
##print(xList)


#Normalize columns in x and labels
nrows = len(xList)
ncols = len(xList[0])

#calculate means and variances
xMeans = []
xSD = []
for i in range(ncols):
    col = [xList[j][i] for j in range(nrows)]
    mean = sum(col)/nrows
    xMeans.append(mean)
    colDiff = [(xList[j][i] - mean) for j in range(nrows)]
    sumSq = sum([colDiff[i] * colDiff[i] for i in range(nrows)])
    stdDev = sqrt(sumSq/nrows)
    xSD.append(stdDev)

#use calculate mean and standard deviation to normalize xList
xNormalized = []
for i in range(nrows):
    rowNormalized = [(xList[i][j] - xMeans[j])/xSD[j] for j in range(ncols)]
    xNormalized.append(rowNormalized)

#Normalize labels
meanLabel = sum(labels)/nrows
sdLabel = sqrt(sum([(labels[i] - meanLabel) * (labels[i] - meanLabel) for i in range(nrows)])/nrows)

labelNormalized = [(labels[i] - meanLabel)/sdLabel for i in range(nrows)]#高斯归一化方法

#initialize a vector of coefficients beta
beta = [0.0] * ncols #beta为一个行向量 包含所有系数

#initialize matrix of betas at each step
betaMat = []
betaMat.append(list(beta))


#number of steps to take
nSteps = 400
stepSize = 0.004  #每次变化数
nzList = []

for i in range(nSteps):#迭代400次
    #calculate residuals
    residuals = [0.0] * nrows
    for j in range(nrows):
        labelsHat = sum([xNormalized[j][k] * beta[k] for k in range(ncols)])#这就是所求的的结果
        residuals[j] = labelNormalized[j] - labelsHat

    #calculate correlation between attribute columns from normalized wine and residual
    corr = [0.0] * ncols

    for j in range(ncols):
        corr[j] = sum([xNormalized[k][j] * residuals[k] for k in range(nrows)]) / nrows

    iStar = 0
    corrStar = corr[0]

    for j in range(1, (ncols)):
        if abs(corrStar) < abs(corr[j]):
            iStar = j; corrStar = corr[j] #取最小

    beta[iStar] += stepSize * corrStar / abs(corrStar)  #更新
    betaMat.append(list(beta))


    nzBeta = [index for index in range(ncols) if beta[index] != 0.0]
    for q in nzBeta:
        if (q in nzList) == False:
            nzList.append(q)

nameList = [names[nzList[i]] for i in range(len(nzList))]

print(nameList)
for i in range(ncols):
    #plot range of beta values for each attribute
    coefCurve = [betaMat[k][i] for k in range(nSteps)]
    xaxis = range(nSteps)
    plot.plot(xaxis, coefCurve)

plot.xlabel("Steps Taken")
plot.ylabel(("Coefficient Values"))
plot.show()

print(betaMat[0:][-1])#输出最优的参数

从LARS结果中选择最佳模型

问题再展开,如何得到一份较为好的惩罚线性回归模型呢?这里就要用到交叉验证的思想。10折交叉验证就是将输入数据且分为10份几乎均等的数据(10折表示10%用于测试,5折表示20%用于测试),将其中一份作为测试数据,其他用于训练模型,然后在移除的数据上进行测试。通过遍历10份数据,每次移除一份数据用于测试,从而对错误进行估计以及预估预测的变化情况。另外切分分数过大导致训练时间成本过高,过少则导致训练中降低算法的准确性。

__author__ = 'mike-bowles'

import urllib2
import numpy
from sklearn import datasets, linear_model
from math import sqrt
import matplotlib.pyplot as plot


#read data into iterable
target_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = urllib2.urlopen(target_url)  #这就下载数据下来了

xList = []
labels = []
names = []
firstLine = True
for line in data:
    if firstLine:
        names = line.strip().split(";")
        firstLine = False
    else:
        #split on semi-colon
        row = line.strip().split(";")
        #put labels in separate array
        labels.append(float(row[-1]))
        #remove label from row
        row.pop()
        #convert row to floats
        floatRow = [float(num) for num in row]
        xList.append(floatRow)

#Normalize columns in x and labels

nrows = len(xList)
ncols = len(xList[0])

#calculate means and variances
xMeans = []
xSD = []
for i in range(ncols):
    col = [xList[j][i] for j in range(nrows)]
    mean = sum(col)/nrows
    xMeans.append(mean)
    colDiff = [(xList[j][i] - mean) for j in range(nrows)]
    sumSq = sum([colDiff[i] * colDiff[i] for i in range(nrows)])
    stdDev = sqrt(sumSq/nrows)
    xSD.append(stdDev)

#use calculated mean and standard deviation to normalize xList
xNormalized = []
for i in range(nrows):
    rowNormalized = [(xList[i][j] - xMeans[j])/xSD[j] for j in range(ncols)]
    xNormalized.append(rowNormalized)

#Normalize labels
meanLabel = sum(labels)/nrows
sdLabel = sqrt(sum([(labels[i] - meanLabel) * (labels[i] - meanLabel) for i in range(nrows)])/nrows)

labelNormalized = [(labels[i] - meanLabel)/sdLabel for i in range(nrows)]
#完成对数据的高斯归一化处理

#Build cross-validation loop to determine best coefficient values.

#number of cross validation folds
nxval = 10 #交叉验证划分份数

#number of steps and step size
nSteps = 350
stepSize = 0.004

#initialize list for storing errors.
errors = []
for i in range(nSteps):
    b = []
    errors.append(b)


for ixval in range(nxval):
    #Define test and training index sets
    idxTest = [a for a in range(nrows) if a%nxval == ixval*nxval]#每次都重新划分训练 测试数据
    idxTrain = [a for a in range(nrows) if a%nxval != ixval*nxval]

    #Define test and training attribute and label sets
    xTrain = [xNormalized[r] for r in idxTrain]
    xTest = [xNormalized[r] for r in idxTest]
    labelTrain = [labelNormalized[r] for r in idxTrain]
    labelTest = [labelNormalized[r] for r in idxTest]
    #数据划分完成    

    #Train LARS regression on Training Data
    nrowsTrain = len(idxTrain)
    nrowsTest = len(idxTest)

    #initialize a vector of coefficients beta
    beta = [0.0] * ncols

    #initialize matrix of betas at each step
    betaMat = []
    betaMat.append(list(beta))

    for iStep in range(nSteps):
        #calculate residuals
        residuals = [0.0] * nrows
        for j in range(nrowsTrain):
            labelsHat = sum([xTrain[j][k] * beta[k] for k in range(ncols)])
            residuals[j] = labelTrain[j] - labelsHat

        #calculate correlation between attribute columns from normalized wine and residual
        corr = [0.0] * ncols

        for j in range(ncols):
            corr[j] = sum([xTrain[k][j] * residuals[k] for k in range(nrowsTrain)]) / nrowsTrain

        iStar = 0
        corrStar = corr[0]

        for j in range(1, (ncols)):
            if abs(corrStar) < abs(corr[j]):
                iStar = j; corrStar = corr[j]

        beta[iStar] += stepSize * corrStar / abs(corrStar)
        betaMat.append(list(beta))

        #Use beta just calculated to predict and accumulate out of sample error - not being used in the calc of beta
        for j in range(nrowsTest):
            labelsHat = sum([xTest[j][k] * beta[k] for k in range(ncols)])
            err = labelTest[j] - labelsHat
            errors[iStep].append(err)    #在单个iStep中的所有样本误差记录在errors

cvCurve = []    #用于存贮每次交叉后的模型MSE
for errVect in errors:    #errVect表示单步的误差向量
    mse = sum([x*x for x in errVect])/len(errVect)  #对所有样本的误差求和平均得mse
    cvCurve.append(mse)

minMse = min(cvCurve) #最佳模型对应的步数
minPt = [i for i in range(len(cvCurve)) if cvCurve[i] == minMse ][0]
print("Minimum Mean Square Error", minMse)
print("Index of Minimum Mean Square Error", minPt)

xaxis = range(len(cvCurve))
plot.plot(xaxis, cvCurve)

plot.xlabel("Steps Taken")
plot.ylabel(("Mean Square Error"))
plot.show()

使用10折交叉验证评估模型性能的优劣,实际训练了10个模型,选择测试误差最小的那个模型。通过先运行交叉验证,然后遍历整个数据集迭代LARS,记录每一步、每一个测试样本的误差errors[iStep][j],交叉验证的结果则为对cvCurve寻找最小,对应索引则为对几次交叉的结果。