• 概述 


  • 算法原理

        logistic回归(Logistic regression) 与多重线性回归实际上有很多相同之处,最大的区别就在于他们的因变量不同,其他的基本都差不多,正是因为如此,这两种回归可以归于同一个家族,即广义线性模型(generalized linear model)。这一家族中的模型形式基本上都差不多,不同的就是因变量不同,如果是连续的,就是多重线性回归,如果是二项分布,就是logistic回归,如果是poisson分布,就是poisson回归,如果是负二项分布,就是负二项回归,等等。只要注意区分它们的因变量就可以了。 [1] 




多因素logistic回归分析 R语言 多因素logistic回归分析意义_数据


多因素logistic回归分析 R语言 多因素logistic回归分析意义_logistic回归_02







  •     决策面


  • 当hθ大于等于0.5时,预测y=1;
  • 当hθ小于0.5时,预测y=0。


  • z=0时g(z)=0.5,
  • z>0时g(z)>0.5,
  • z<0时g(z)<0.5

        因为z=θTx  ,所以满足:

  • θTx大于等于0时,预测y=1;
  • θTx小于0时,预测y=0。

        假设我们有一个模型:hθ(x)=g(θ0+θ1x1+θ2x2) ,并且参数θ 满足的向量模型是[-3 1 1]则当-3+x1+x2 大于等于0是,即x1+x2  大于3时,我们预测y=1 ,由此可以得到,x1+x2=3是这个模型的分界线。如下图所示:

多因素logistic回归分析 R语言 多因素logistic回归分析意义_多因素logistic回归分析 R语言_03

  •     损失函数


多因素logistic回归分析 R语言 多因素logistic回归分析意义_多因素logistic回归分析 R语言_04


多因素logistic回归分析 R语言 多因素logistic回归分析意义_数据_05


多因素logistic回归分析 R语言 多因素logistic回归分析意义_数据_06


        Andrew Ng在课程中直接给出了交叉熵损失Cost函数及数据集全部损失J(θ)函数,但是并没有给出具体的解释,只是说明了这个函数来衡量hθ(x)函数预测的好坏是合理的。

多因素logistic回归分析 R语言 多因素logistic回归分析意义_数据_07


多因素logistic回归分析 R语言 多因素logistic回归分析意义_线性回归_08

  • 算法Demo


# -*- coding:utf-8 -*-

import matplotlib.pyplot as plt

from numpy.ma import arange

# 描绘最佳拟合直线

def plotBestFit(w, b, dataMat, labelMat):

    m = dataMat.shape[0]

    xcord1 = []

    ycord1 = []

    xcord2 = []

    ycord2 = []

    for i in range(m):

        if int(labelMat[i]) == 1:

            xcord1.append(dataMat[i, 0])

            ycord1.append(dataMat[i, 1])


            xcord2.append(dataMat[i, 0])

            ycord2.append(dataMat[i, 1])

    fig = plt.figure()

    ax = fig.add_subplot(111)

    ax.scatter(xcord1, ycord1, s=30, c='red', marker='x')

    ax.scatter(xcord2, ycord2, s=30, c='green', marker='o')

    x = arange(-3.0, 3.0, 0.1)

    y = (-w[0] * x - b) / w[1]

    ax.plot(x, y)





# -*- coding:utf-8 -*-

import numpy as np

from plotBestFit import plotBestFit

# sigmoid函数

def sigmoid(z):

    s = 1.0 / (1 + np.exp(-z))

    return s

# 计算梯度和损失值

def calculate_grads_cost(w, b, X, Y):

    m = X.shape[0]

    A = sigmoid(np.dot(X, w) + b)

    cost = -np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A)) / m

    dw = np.dot(X.T, A - Y)

    db = np.sum(A - Y)

    grads = {

        "dw": dw,

        "db": db


    return grads, cost

# 使用梯度下降优化cost

def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost=False):

    costs = []

    for i in range(num_iterations):

        grads, cost = calculate_grads_cost(w, b, X, Y)

        dw = grads["dw"]

        db = grads["db"]

        w = w - learning_rate * dw

        b = b - learning_rate * db

        if i % 100 == 0:


        if print_cost and i % 100 == 0:

            print("第 %d 轮迭代后的损失值为: %f" % (i, cost))

    params = {

        "w": w,

        "b": b


    grads = {

        "dw": dw,

        "db": db


    return params, grads, costs

if __name__ == "__main__":

    data = np.loadtxt("testSet.txt")

    dataMat = data[:, 0:2]

    labelMat = data[:, 2]

    m, n = dataMat.shape

    labelMat = labelMat.reshape(m, 1)

    w = np.zeros((n, 1))

    b = 0

    params, grads, costs = optimize(w, b, dataMat, labelMat, 200, 0.1)

    w = params["w"]

    b = params["b"]

    # print(w)

    # print(b)

    # 可视化结果

    plotBestFit(w, b, dataMat, labelMat)



Sk-Learn Demo:


# -*- coding: utf-8 -*-



Logistic Regression 3-class Classifier


Show below is a logistic-regression classifiers decision boundaries on the

first two dimensions (sepal length and width) of the `iris

<https://en.wikipedia.org/wiki/Iris_flower_data_set>`_ dataset. The datapoints

are colored according to their labels.



# Code source: Gaël Varoquaux

# Modified for documentation by Jaques Grobler

# License: BSD 3 clause

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn import datasets

# import some data to play with

iris = datasets.load_iris()

X = iris.data[:, :2]  # we only take the first two features.

Y = iris.target

logreg = LogisticRegression(C=1e5, solver='lbfgs', multi_class='multinomial')

# Create an instance of Logistic Regression Classifier and fit the data.

logreg.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each

# point in the mesh [x_min, x_max]x[y_min, y_max].

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5

y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

h = .02  # step size in the mesh

xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot

Z = Z.reshape(xx.shape)

plt.figure(1, figsize=(4, 3))

plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points

plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)

plt.xlabel('Sepal length')

plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())

plt.ylim(yy.min(), yy.max())


