logistic回归人口增长预测 python logistic回归代码

转载

mob64ca13f7419f 2024-01-18 20:11:46

文章标签 人工智能算法机器学习线性回归阶跃函数 ci 文章分类 Python 后端开发

一、Logistic回归原理

（1）从线性回归到Logistic回归

假设我们给定d个属性描述的样本 x=(x1,x2,,...,xd) x = ( x 1 , x 2 , , . . . , x d ) ，则对应的的线性回归模型为：

f(x)=θ1x1+θ2x2+...+θdxd+b(1) (1) f ( x ) = θ 1 x 1 + θ 2 x 2 + . . . + θ d x d + b

令 b=θ0、x0=1 b = θ 0 、 x 0 = 1 ，则上式可以写成：

f(x)=θ0x0+θ1x1+θ2x2+...+θdxd=θTx(2)(3) (2) f ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ d x d (3) = θ T x

，其中θT=(θ0,θ1,θ2,...,θd)T,x=(x1,x2,,...,xd)(4) (4) ，其中 θ T = ( θ 0 , θ 1 , θ 2 , . . . , θ d ) T , x = ( x 1 , x 2 , , . . . , x d )

现在考虑二分类问题，样本的标记 y∈{0,1} y ∈ { 0 , 1 } ，而线性回归函数 f(x) f ( x ) 输出的是实值，于是我们希望将线性回归的输出映射到0/1值，最理想的映射就是“单位阶跃函数”(unit-step funciton)：

y=⎧⎩⎨0,z<0;0.5,z=0;1,z>0,其中z就是线性回归函数f(x).(5) (5) y = { 0 , z < 0 ; 0.5 , z = 0 ; 1 , z > 0 , 其中 z 就是线性回归函数 f ( x ) .

但是单位阶跃函数不连续，不利于我们使用机器学习中的优化迭代算法求最优解，因此我们使用单调可微的对数几率函数（logistic function）作为单位阶跃函数的代替函数：

y=11+e−z(6) (6) y = 1 1 + e − z

它们对应的图为：

logistic回归人口增长预测 python logistic回归代码_机器学习

于是我们使用对数几率函数对线性回归方程进行映射，令 z=f(x) z = f ( x ) ，得到Logistic回归的表达式：

hθ(x)=11+e−f(x)=11+e−θTx(7)(8) (7) h θ ( x ) = 1 1 + e − f ( x ) (8) = 1 1 + e − θ T x

如果 hθ(x)>0.5 h θ ( x ) > 0.5 ，则分类结果 y=1 y = 1 ；如果 hθ(x)≤0.5 h θ ( x ) ≤ 0.5 ，则分类结果 y=0 y = 0 .

（2）用最大似然估计法MLE（Maximum Likelihood Estimate）求最优解 θ∗ θ ∗

在给定某组参数 θ=(θ0,θ1,θ2,...,θd) θ = ( θ 0 , θ 1 , θ 2 , . . . , θ d ) 时，样本 x=(x1,x2,,...,xd) x = ( x 1 , x 2 , , . . . , x d ) 的分类结果 y=1 y = 1 的概率为：

P(y=1|x;θ)=hθ(x)(9) (9) P ( y = 1 | x ; θ ) = h θ ( x )

相应的分类结果 y=0 y = 0 的概率为：

P(y=0|x;θ)=1−hθ(x)(10) (10) P ( y = 0 | x ; θ ) = 1 − h θ ( x )

由于现在是二分类问题，即是0/1分布，所以样本 x x 属于类别 y(y只能取0或1) y ( y 只能取 0 或 1 ) 的概率为：

P(y|x;θ)=(hθ(x))y(1−hθ(x))1−y(11) (11) P ( y | x ; θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y

假设我们有m个样本 ((x1,y1),(x2,y2),,...,(xm,ym)) ( ( x 1 , y 1 ) , ( x 2 , y 2 ) , , . . . , ( x m , y m ) ) ，则对应的似然函数为：

L(θ)=∏i=0mP(yi|xi;θ)=∏i=0m(hθ(xi))yi(1−hθ(xi))1−yi(12)(13) (12) L ( θ ) = ∏ i = 0 m P ( y i | x i ; θ ) (13) = ∏ i = 0 m ( h θ ( x i ) ) y i ( 1 − h θ ( x i ) ) 1 − y i

似然函数取对数，得到对数似然函数：

ℓ(θ)=ln(L(θ))=∑i=0myiln(hθ(xi))+(1−yi)ln(1−hθ(xi))(14)(15) (14) ℓ ( θ ) = l n ( L ( θ ) ) (15) = ∑ i = 0 m y i l n ( h θ ( x i ) ) + ( 1 − y i ) l n ( 1 − h θ ( x i ) )

使用梯度上升法求解最大似然估计值：
为了方便求解，令 ℓ(θ)=∑mi=0ki(θ) ℓ ( θ ) = ∑ i = 0 m k i ( θ ) 。这里只是简单的符号变换，先忽略上标 i i ，并将 hθ(x)写成h(θTx) h θ ( x ) 写成 h ( θ T x ) ，则 k(θ)=yln(h(θTx))+(1−y)ln(1−h(θTx)) k ( θ ) = y l n ( h ( θ T x ) ) + ( 1 − y ) l n ( 1 − h ( θ T x ) ) ，所以函数 k(θ) k ( θ ) 沿第 j 个参数 θj θ j 方向的梯度为：

∂∂θjk(θ)=(y1h(θTx)−(1−y)11−h(θTx))∂∂θjh(θTx)=(y1h(θTx)−(1−y)11−h(θTx))h(θTx)(1−h(θTx))∂∂θjθTx=(y(1−h(θTx))−(1−y)h(θTx))xj=(y−h(θTx))xj=(y−hθ(x))xj(16)(17)(18)(19)(20) (16) ∂ ∂ θ j k ( θ ) = ( y 1 h ( θ T x ) − ( 1 − y ) 1 1 − h ( θ T x ) ) ∂ ∂ θ j h ( θ T x ) (17) = ( y 1 h ( θ T x ) − ( 1 − y ) 1 1 − h ( θ T x ) ) h ( θ T x ) ( 1 − h ( θ T x ) ) ∂ ∂ θ j θ T x (18) = ( y ( 1 − h ( θ T x ) ) − ( 1 − y ) h ( θ T x ) ) x j (19) = ( y − h ( θ T x ) ) x j (20) = ( y − h θ ( x ) ) x j

结合 ∂∂θjk(θ) ∂ ∂ θ j k ( θ ) ，并添加上标，则某个样本 (xi,yi)=((x1,x2,....,xd)i,yi) ( x i , y i ) = ( ( x 1 , x 2 , . . . . , x d ) i , y i ) 的对数似然函数 ℓ(θ) ℓ ( θ ) 沿第 j 个参数 θj θ j 方向的梯度为：

∂∂θjℓ(θ)=∂∂θj(∑i=0mki(θ))=∑i=0m(yi−hθ(xi))xj(21)(22) (21) ∂ ∂ θ j ℓ ( θ ) = ∂ ∂ θ j ( ∑ i = 0 m k i ( θ ) ) (22) = ∑ i = 0 m ( y i − h θ ( x i ) ) x j

其中 j j 是 d d 个属性的下标， j∈(1,2,....,d) j ∈ ( 1 , 2 , . . . . , d ) ，由于是线性模型，因此样本 x x 是多少维，就有多少个参数 θ θ ； i i 是 m m 个样本的数量， i∈(1,2,...,m) i ∈ ( 1 , 2 , . . . , m )

1.使用MLE的角度求解最优参数组合 θ∗ θ ∗

因为目的是求对数似然函数 ℓ(θ) ℓ ( θ ) 的最大值，沿梯度上升的方向逐步逼近极大值点即可。
下面分别列出3种梯度下降法对应的参数 θj θ j 更新公式（虽然使用的是梯度下降法，但是求MLE是沿梯度方向上升求最大值，所以公式中是使用加法 ‘+’）：
假设我们有m个样本 (xi,yi)=((x1,x2,....,xd)i,yi),i∈(1,2,...,m) ( x i , y i ) = ( ( x 1 , x 2 , . . . . , x d ) i , y i ) , i ∈ ( 1 , 2 , . . . , m ) ，每个样本都有 d d

批量梯度下降法BGD（Batch Gradient Descent）
批量梯度下降法，在更新参数时使用所有的样本来进行更新。由于我们有m个样本，这里求梯度的时候就用了所有m个样本的梯度数据。
θj=θj+α∂∂θjℓ(θ)=θj+α∑i=0m(yi−hθ(xi))xj(23)(24) (23) θ j = θ j + α ∂ ∂ θ j ℓ ( θ ) (24) = θ j + α ∑ i = 0 m ( y i − h θ ( x i ) ) x j
随机梯度下降法SGD（Stochastic Gradient Descent）
与BGD的区别在于求梯度时，没有用所有的m个样本的数据，而是仅仅随机选取一个样本 xi x i 来求梯度。
θj=θj+α(yi−hθ(xi))xj(25) (25) θ j = θ j + α ( y i − h θ ( x i ) ) x j
小批量梯度下降法MBGD（Mini-batch Gradient Descent）
小批量梯度下降法是批量梯度下降法和随机梯度下降法的折衷，也就是从m个样本中，随机选取一部分来迭代。假设从m个样本中，随机选取了 T 个样本，则对应的公式为：
θj=θj+α∑t=1T(yt−hθ(xt))xj(26) (26) θ j = θ j + α ∑ t = 1 T ( y t − h θ ( x t ) ) x j

2.使用损失函数 loss(yi,y^i) l o s s ( y i , y ^ i ) 的角度求解最优参数组合 θ∗ θ ∗

损失函数 loss(yi,y^i) l o s s ( y i , y ^ i ) 等于负对数似然函数NLL(Negative Log-Likelihood)：

loss(yi,y^i)=−ℓ(θ)(27) (27) l o s s ( y i , y ^ i ) = − ℓ ( θ )

所以求解最优参数组合

θ∗ θ ∗ 就是沿 loss(yi,y^i) l o s s ( y i , y ^ i ) 梯度下降的方向求极小值（与MLE角度求解相反，因为二者是负号’-‘的关系）。因此对应的三种梯度下降公式对应地使用负号’-‘：

批量梯度下降法BGD（Batch Gradient Descent）
θj=θj−α(−∂∂θjℓ(θ))=θj−α∑i=0m(hθ(xi)−yi)xj(28)(29) (28) θ j = θ j − α ( − ∂ ∂ θ j ℓ ( θ ) ) (29) = θ j − α ∑ i = 0 m ( h θ ( x i ) − y i ) x j
随机梯度下降法SGD（Stochastic Gradient Descent）
θj=θj−α(hθ(xi)−yi)xj(30) (30) θ j = θ j − α ( h θ ( x i ) − y i ) x j
小批量梯度下降法MBGD（Mini-batch Gradient Descent）
θj=θj−α∑t=1T(hθ(xt)−yt)xj(31) (31) θ j = θ j − α ∑ t = 1 T ( h θ ( x t ) − y t ) x j

二、代码实战

这一部分是参考《机器学习实战》的代码，这里就公式符号及最优解求解过程给出必要的解释：
1. 《机器学习实战》使用的 w w 代表本文的 θ θ ，相应的线性回归方程为：

f(x)=w0x0+w1x1+w2x2+...+wdxd=wTx(32)(33) (32) f ( x ) = w 0 x 0 + w 1 x 1 + w 2 x 2 + . . . + w d x d (33) = w T x
2. 对应的梯度上升公式为：
a. 批量梯度下降法BGD

wj=wj+α∂∂θjℓ(w)=wj+α∑i=0m(yi−hw(xi))xj(34)(35) (34) w j = w j + α ∂ ∂ θ j ℓ ( w ) (35) = w j + α ∑ i = 0 m ( y i − h w ( x i ) ) x j
b. 随机梯度下降法SGD

wj=wj+α(yi−hw(xi))xj(36) (36) w j = w j + α ( y i − h w ( x i ) ) x j
c. 小批量梯度下降法MBGD

wj=wj+α∑t=1T(yt−hw(xt))xj(37) (37) w j = w j + α ∑ t = 1 T ( y t − h w ( x t ) ) x j

（1）读入含有100个样本的数据集，并建立Logistic方程，使用批量梯度上升降法BGD求最优解 w∗ w ∗

1.数据集的样子

logistic回归人口增长预测 python logistic回归代码_机器学习_02

2.读入数据集

'''
    由于目标函数是z=b+w1x1+w2x2，令x0=1，b=w0x0。将每个样本的第一个特征作为x1，第二个特征作为x2。
    所以矩阵dataMat的第一列都为1（代表x0=1）
'''
def loadDataSet():
    dataMat = []; labelMat = []
    fr = open('testSet.txt')
    for line in fr.readlines():
        lineArr = line.strip().split()
        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    return dataMat,labelMat

if __name__=='__main__':
    dataMat, labelMat=loadDataSet()
    print(dataMat)
    print(labelMat)

logistic回归人口增长预测 python logistic回归代码_线性回归_03

3.BGD求最优解 w∗ w ∗

'''
    def sigmoid(inX):如果inX是一个数字，则返回sigmoid(x)。如果inX是一个向量如[1,2,3]，则返回的
    是[sigmoid(1),sigmoid(2),sigmoid(3)]
'''
def sigmoid(inX):
    return 1.0/(1+exp(-inX))

'''
   def gradAscent(dataMatIn, classLabels): 使用Logistic回归的BGD公式，设置步长为0.001，迭代次数为500
'''
def gradAscent(dataMatIn, classLabels):
    dataMatrix = mat(dataMatIn)             #convert to NumPy matrix
    labelMat = mat(classLabels).transpose() #convert to NumPy matrix
    m,n = shape(dataMatrix)
    alpha = 0.001 #梯度上升的步长设置为0.001
    maxCycles = 500 #重复迭代500次
    weights = ones((n,1))
    for k in range(maxCycles):              #heavy on matrix operations
        h = sigmoid(dataMatrix*weights)     #matrix mult
        error = (labelMat - h)              #vector subtraction
        weights = weights + alpha * dataMatrix.transpose()* error #matrix mult
    return weights

if __name__=='__main__':
    dataMat, labelMat=loadDataSet()
    best_weights=gradAscent(dataMat,labelMat)
    print('最优W*=[w0,w1,w2]^T：\n',best_weights)

logistic回归人口增长预测 python logistic回归代码_机器学习_04

4.画出BGD求出的决策边界

因为当sigmoid函数 hw(xi)>0.5 h w ( x i ) > 0.5 ，则分类为 y=1 y = 1 ，当 hw(xi)≤0.5 h w ( x i ) ≤ 0.5 ，则分类为 y=0 y = 0 ，所以决策边界为 hw(xi)=0.5 h w ( x i ) = 0.5 ，即 0=w0x0+w1x1+w2x2=w0+w1x1+w2x2(x0=1) 0 = w 0 x 0 + w 1 x 1 + w 2 x 2 = w 0 + w 1 x 1 + w 2 x 2 ( x 0 = 1 ) 。当 w∗=[w0,w1,w2]T w ∗ = [ w 0 , w 1 , w 2 ] T 时，对应的决策边界直线为： x2=(−w0−w1x1)/w2 x 2 = ( − w 0 − w 1 x 1 ) / w 2

'''
    def plotBestFit(weights):先将文件'testSet.txt'中的数据画在图中，并根据输入的weights画出决策边界
'''
def plotBestFit(weights):
    import matplotlib.pyplot as plt
    weights=weights.getA()
    dataMat,labelMat=loadDataSet()
    dataArr = array(dataMat)
    n = shape(dataArr)[0] 
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(labelMat[i])== 1:
            xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
    ax.scatter(xcord2, ycord2, s=30, c='green')
    x = arange(-3.0, 3.0, 0.1)
    y = (-weights[0]-weights[1]*x)/weights[2]
    ax.plot(x, y)
    plt.xlabel('X1'); plt.ylabel('X2');
    plt.show()

if __name__=='__main__':
    dataMat, labelMat=loadDataSet()
    weights=gradAscent(dataMat,labelMat)
    plotBestFit(weights)

logistic回归人口增长预测 python logistic回归代码_机器学习_05

（2）使用SGD求最优解 w∗ w ∗

1. 画出迭代步长固定为0.01的SGD对应的决策边界

def stocGradAscent0(dataMatIn, classLabels):
    dataMatrix = mat(dataMatIn)
    m,n = shape(dataMatrix)
    alpha = 0.01
    weights = ones((n,1))  #initialize to all ones
    for i in range(m):
        h = sigmoid(dataMatrix[i]*weights)
        error = classLabels[i] - h
        weights = weights + alpha *(dataMatrix[i].transpose())*error
    return weights

if __name__=='__main__':
    dataMat, labelMat=loadDataSet()
    best_weights=stocGradAscent0(dataMat,labelMat)
    print('最优W*=[w0,w1,w2]^T：\n', best_weights)
    plotBestFit(best_weights)

logistic回归人口增长预测 python logistic回归代码_ci_06

2.画出迭代步长逐渐减小的SGD对应的决策边界

def stocGradAscent1(dataMatIn, classLabels, numIter=150):
    dataMatrix = mat(dataMatIn)
    m,n = shape(dataMatrix)
    weights = ones((n, 1))  # initialize to all ones
    for j in range(numIter):
        dataIndex = list(range(m))
        for i in range(m):
            alpha = 4/(1.0+j+i)+0.0001    #apha decreases with iteration, does not go to 0 because of the constant
            randIndex = int(random.uniform(0,len(dataIndex)))
            h = sigmoid(sum(dataMatrix[randIndex]*weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha*(dataMatrix[randIndex].transpose())* error
            del(dataIndex[randIndex])
    return weights

if __name__=='__main__':
    dataMat, labelMat=loadDataSet()
    best_weights=stocGradAscent1(dataMat,labelMat)
    print('最优W*=[w0,w1,w2]^T：\n', best_weights)
    plotBestFit(best_weights)

logistic回归人口增长预测 python logistic回归代码_线性回归_07

（3）从疝气病症预测病马的死亡率

1.疝气病数据集介绍

logistic回归人口增长预测 python logistic回归代码_人工智能算法_08

logistic回归人口增长预测 python logistic回归代码_人工智能算法_09

2.处理数据中的缺失值

logistic回归人口增长预测 python logistic回归代码_线性回归_10

logistic回归人口增长预测 python logistic回归代码_ci_11

3.使用Logistic回归分类函数进行分类

'''
    def classifyVector(inX, weights):inX是待分类样本的特征组成的向量，weights是Logistic回归的参数。
    如果Logistic回归结果>0.5则分类为1；如果回归结果<=0.5，则分类为0
'''
def classifyVector(inX, weights):
    prob = sigmoid(sum(inX*weights))
    if prob > 0.5:
        return 1.0
    else:
        return 0.0

def colicTest():
    frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt')
    trainingSet = []; trainingLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[21]))
    trainWeights = stocGradAscent1(trainingSet, trainingLabels, 1000)
    errorCount = 0; numTestVec = 0.0
    for line in frTest.readlines():
        numTestVec += 1.0
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]):
            errorCount += 1
    errorRate = (float(errorCount)/numTestVec)
    print ("the error rate of this test is: %f" % errorRate)
    return errorRate

'''
    def multiTest():调用函数colicTest()10次求结果的平均值
'''
def multiTest():
    numTests = 10; errorSum=0.0
    for k in range(numTests):
        errorSum += colicTest()
    print ("after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests)) )


if __name__=='__main__':
    multiTest()

logistic回归人口增长预测 python logistic回归代码_阶跃函数_12