线性和非线性回归分析 R 线性回归与非线性回归

转载

架构设计师 2024-05-28 22:17:01

文章标签 线性和非线性回归分析 R 逻辑回归线性回归机器学习数据 文章分类 机器学习人工智能

本系列是2022年12月DataWhale组队学习中sklearn机器学习实战的学习任务，一共分为八个任务章节，开源的在线学习地址在这里，下面我们就开始本次学习之旅了！

线性回归

线性：两个变量之间的关系是一次函数关系的——图象是直线，叫做线性。
非线性：两个变量之间的关系不是一次函数关系的——图象不是直线，叫做非线性。
回归：人们在测量事物的时候因为客观条件所限，求得的都是测量值，而不是事物真实的值，为了能够得到真实值，无限次的进行测量，最后通过这些测量数据计算回归到真实值，这就是回归的由来。

线性回归就是利用的样本 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习$ ， $线性和非线性回归分析 R 线性回归与非线性回归_线性回归_02$ 是特征数据，可能是一个，也可能是多个，通过有监督的学习，学习到由 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_03$ 到 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_04$ 的映射 $线性和非线性回归分析 R 线性回归与非线性回归_线性和非线性回归分析 R_05$ ，利用该映射关系对未知的数据进行预估，因为 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_04$ 为连续值，所以是回归问题。

首先学习了解一元线性回归： $线性和非线性回归分析 R 线性回归与非线性回归_数据_07$ 的形式，假设真实的一元线性回归函数是 $线性和非线性回归分析 R 线性回归与非线性回归_数据_08$ ，然后用一个true_fun函数来定义该一元线性回归函数，然后通过添加一些随机扰动，形成训练的数据集，训练集的样本大小设置为30个样本点。

import numpy as np
import matplotlib.pyplot as plt

def true_fun(X):
    return 1.5*X + 0.2

np.random.seed(0) # 随机种子
n_samples = 30
'''生成随机数据作为训练集'''
X_train = np.sort(np.random.rand(n_samples)) 
y_train = (true_fun(X_train) + np.random.randn(n_samples) * 0.05).reshape(n_samples,1)

训练集画出的图如下：

线性和非线性回归分析 R 线性回归与非线性回归_线性和非线性回归分析 R_09

实际上该训练集所拟合出的真实模型就是定义的true_fun函数，但是我们并不知道该模型到底是什么，因此我们一元线性回归去训练数据，并且拟合出我们的真实模型。

实际模型会存在一个偏置量 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_10$ ，以一元为例， $线性和非线性回归分析 R 线性回归与非线性回归_逻辑回归_11$ , 实际使用梯度下降法时可以添加一维并令 $线性和非线性回归分析 R 线性回归与非线性回归_数据_12$ ,则求出的 $线性和非线性回归分析 R 线性回归与非线性回归_数据_13$

在求实际模型参数的时候，往往用到梯度下降法来求解模型的参数。具体梯度下降法可以参考这篇文档。

具体简单说来，假设给定模型，即一元线性回归的假设函数 $线性和非线性回归分析 R 线性回归与非线性回归_逻辑回归_14$ 以及目标函数(损失函数): $线性和非线性回归分析 R 线性回归与非线性回归_线性和非线性回归分析 R_15$ 其中 $线性和非线性回归分析 R 线性回归与非线性回归_线性回归_16$ 表示数据的量，我们目标是为了 $线性和非线性回归分析 R 线性回归与非线性回归_数据_17$ 尽可能小，所以这里加上 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_18$ 为了后面的简化，即 $线性和非线性回归分析 R 线性回归与非线性回归_逻辑回归_19$ 。那么梯度则为： $线性和非线性回归分析 R 线性回归与非线性回归_线性和非线性回归分析 R_20$

设 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_03$ 是(m,n)维的矩阵， $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_04$ 是(m,1)维度的矩阵， $线性和非线性回归分析 R 线性回归与非线性回归_线性回归_23$ 是预测的值，维度与 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_04$ 相同，那么梯度用矩阵表示如下: $线性和非线性回归分析 R 线性回归与非线性回归_逻辑回归_25$

梯度下降算法的过程

线性和非线性回归分析 R 线性回归与非线性回归_线性和非线性回归分析 R_26

自己实现的具体代码过程如下：

# 添加一维数据
data_X = [] 
for x in X_train:
    data_X.append([1,x])
data_X = np.array((data_X))

m,p = np.shape(data_X) # m, 数据量 p: 特征数
max_iter = 1000 # 迭代数
weights = np.ones((p,1))  # 初始化权重向量
alpha = 0.1 # 学习率
for i in range(0,max_iter):
    error = np.dot(data_X,weights)- y_train
    gradient = data_X.transpose().dot(error)/m
    weights = weights - alpha * gradient
print("输出参数w:",weights[1:][0]) # 输出模型参数w
print("输出参数:b",weights[0]) # 输出参数b

输出参数w: [1.445439]
输出参数:b [0.22683262]

讲整个训练得到的模型，和真实true_fun模型以及训练集可视化显示为

X_test = np.linspace(0, 1, 100)
plt.plot(X_test, X_test*weights[1][0]+weights[0][0], label="Model") 
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X_train,y_train) # 画出训练集的点
plt.legend(loc="best")
plt.show()

线性和非线性回归分析 R 线性回归与非线性回归_数据_27

由图像可以看出，由梯度下降算法拟合出的模型与实际的模型很接近，模型表现十分好。

使用Sklearn实现

scikit-learn，简称sklearn，是一个开源的基于python语言的机器学习工具包。它通过NumPy, SciPy和Matplotlib等python数值计算的库实现高效的算法应用，并且涵盖了几乎所有主流机器学习算法。

官网

上述过程改成用sklearn实现的代码如下，不同之处在于下面的代码直接调用了sklearn库中的LinearRegression类来定义线性回归模型，线性回归中特征的数量取决于训练数据中的特征值的个数。例如本例中，训练数据集特征值的个数为1，因此该模型为一元线性回归模型。

import numpy as np
from sklearn.linear_model import LinearRegression # 导入线性回归模型
import matplotlib.pyplot as plt

def true_fun(X):
    return 1.5*X + 0.2

np.random.seed(0) # 随机种子
n_samples = 30
'''生成随机数据作为训练集'''
X_train = np.sort(np.random.rand(n_samples)) 
y_train = (true_fun(X_train) + np.random.randn(n_samples) * 0.05).reshape(n_samples,1)

model = LinearRegression() # 定义模型
model.fit(X_train[:,np.newaxis], y_train) # 训练模型

print("输出参数w:",model.coef_) # 输出模型参数w
print("输出参数:b",model.intercept_) # 输出参数b

X_test = np.linspace(0, 1, 100)
plt.plot(X_test, model.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X_train,y_train) # 画出训练集的点
plt.legend(loc="best")
plt.show()

输出参数w: [[1.4474774]]
输出参数:b [0.22557542]

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LlXDgV89-1639577141414)(F:\WorkProgramData\AppData\Typora\Images\image-20211215201637013.png)]$

多元线性回归

以三元线性回归为例， $线性和非线性回归分析 R 线性回归与非线性回归_数据_29$ 。

from sklearn.linear_model import LinearRegression

X_train = [[1,1,1],[1,1,2],[1,2,1]]
y_train = [[6],[9],[8]]
 
model = LinearRegression()
model.fit(X_train, y_train)
print("输出参数w:",model.coef_) # 输出参数w1,w2,w3
print("输出参数b:",model.intercept_) # 输出参数b
test_X = [[1,3,5]]
pred_y = model.predict(test_X)
print("预测结果:",pred_y)

输出参数w: [[0. 2. 3.]]
输出参数b: [1.]
预测结果: [[22.]]

以上所有的回归模型都是线性模型，而在实际问题中，往往出现更复杂的数学模型，线性回归已经不能满足回归需求，这个时候，就需要用到更为复杂的多项式回归。

多项式回归

多项式回归所拟合的往往是复杂的曲线。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

def true_fun(X):
    return np.cos(1.5 * np.pi * X)
np.random.seed(0)

n_samples = 30
degrees = [1, 4, 15] # 多项式最高次

X = np.sort(np.random.rand(n_samples)) 
y = true_fun(X) + np.random.randn(n_samples) * 0.1

plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i],
                                             include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)]) # 使用pipline串联模型
    pipeline.fit(X[:, np.newaxis], y)

    # 使用交叉验证
    scores = cross_val_score(pipeline, X[:, np.newaxis], y,
                             scoring="neg_mean_squared_error", cv=10)
    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
        degrees[i], -scores.mean(), scores.std()))
plt.show()

线性和非线性回归分析 R 线性回归与非线性回归_线性和非线性回归分析 R_30

逻辑回归

逻辑回归用来解决分类问题，线性回归的结果 $线性和非线性回归分析 R 线性回归与非线性回归_数据_31$ 带入一个非线性变换的Sigmoid函数中，得到 $线性和非线性回归分析 R 线性回归与非线性回归_数据_32$ 之间取值范围的数 $线性和非线性回归分析 R 线性回归与非线性回归_数据_33$ ， $线性和非线性回归分析 R 线性回归与非线性回归_数据_33$ 可以把它看成是一个概率值，如果我们设置概率阈值为0.5，那么 $线性和非线性回归分析 R 线性回归与非线性回归_数据_33$ 大于0.5可以看成是正样本，小于0.5看成是负样本，就可以进行分类了。

逻辑回归的本质：极大似然估计
逻辑回归的激活函数：Sigmoid
逻辑回归的代价函数：交叉熵

Sigmoid函数

函数公式如下：

$线性和非线性回归分析 R 线性回归与非线性回归_逻辑回归_36$

线性和非线性回归分析 R 线性回归与非线性回归_数据_37

函数中 $线性和非线性回归分析 R 线性回归与非线性回归_线性回归_38$ 无论取什么值，其结果都在 $线性和非线性回归分析 R 线性回归与非线性回归_线性和非线性回归分析 R_39$ 的区间内，回想一下，一个分类问题就有两种答案，一种是“是”，一种是“否”，那0对应着“否”，1对应着“是”，那又有人问了，你这不是 $线性和非线性回归分析 R 线性回归与非线性回归_数据_32$ 的区间吗，怎么会只有0和1呢？这个问题问得好，我们假设分类的阈值是0.5，那么超过0.5的归为1分类，低于0.5的归为0分类，阈值是可以自己设定的。

好了，接下来我们把 $线性和非线性回归分析 R 线性回归与非线性回归_线性和非线性回归分析 R_41$ 带入 $线性和非线性回归分析 R 线性回归与非线性回归_线性回归_38$ 中就得到了我们的逻辑回归的一般模型方程：

逻辑回归的假设函数： $线性和非线性回归分析 R 线性回归与非线性回归_逻辑回归_43$ 结果 $线性和非线性回归分析 R 线性回归与非线性回归_逻辑回归_44$ 也可以理解为概率，换句话说概率大于0.5的属于1分类，概率小于0.5的属于0分类，这就达到了分类的目的。

损失函数

逻辑回归的损失函数是对数似然函数 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_45$ 两式合并得到概率分布表达式： $线性和非线性回归分析 R 线性回归与非线性回归_数据_46$ 对数似然函数最大化得到似然函数的代数表达式为：

$线性和非线性回归分析 R 线性回归与非线性回归_数据_47$ 取反得到损失函数表达式： $线性和非线性回归分析 R 线性回归与非线性回归_逻辑回归_48$

其梯度为： $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_49$ 其推到如下： $线性和非线性回归分析 R 线性回归与非线性回归_线性和非线性回归分析 R_50$ 因为 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_51$ 所以 $线性和非线性回归分析 R 线性回归与非线性回归_线性回归_52$ 用numpy实现逻辑回归

import sys
from pathlib import Path
curr_path = str(Path().absolute())
parent_path = str(Path().absolute().parent)
sys.path.append(parent_path) # add current terminal path to sys.path

import numpy as np
from Mnist.load_data import load_local_mnist

(x_train, y_train), (x_test, y_test) = load_local_mnist(one_hot=False)

# print(np.shape(x_train),np.shape(y_train))

ones_col=[[1] for i in range(len(x_train))] # 生成全为1的二维嵌套列表，即[[1],[1],...,[1]]
x_train_modified=np.append(x_train,ones_col,axis=1)
ones_col=[[1] for i in range(len(x_test))]
x_test_modified=np.append(x_test,ones_col,axis=1)

# print(np.shape(x_train_modified))

# Mnsit有0-9十个标记，由于是二分类任务，所以可以将标记0的作为1，其余为0用于识别是否为0的任务
y_train_modified=np.array([1 if y_train[i]==1 else 0 for i in range(len(y_train))])
y_test_modified=np.array([1 if y_test[i]==1 else 0 for i in range(len(y_test))])
n_iters=10 

x_train_modified_mat = np.mat(x_train_modified)
theta = np.mat(np.zeros(len(x_train_modified[0])))
lr = 0.01 # 学习率

def sigmoid(x):
    '''sigmoid函数
    '''
    return 1.0/(1+np.exp(-x))

for i_iter in range(n_iters):
    for n in range(len(x_train_modified)):
        hypothesis = sigmoid(np.dot(x_train_modified[n], theta.T))
        error = y_train_modified[n]- hypothesis
        grad = error*x_train_modified_mat[n]
        theta += lr*grad
    print('LogisticRegression Model(learning_rate={},i_iter={})'.format(
    lr, i_iter+1))

用sklearn实现逻辑回归

import sys
from pathlib import Path
curr_path = str(Path().absolute())
parent_path = str(Path().absolute().parent)
sys.path.append(parent_path) # add current terminal path to sys.path

from Mnist.load_data import load_local_mnist

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

(X_train, y_train), (X_test, y_test) = load_local_mnist(normalize = False,one_hot = False)

X_train, y_train= X_train[:2000], y_train[:2000] 
X_test, y_test = X_test[:200],y_test[:200]

# solver：即使用的优化器，lbfgs：拟牛顿法， sag：随机梯度下降
model = LogisticRegression(solver='lbfgs', max_iter=500) # lbfgs：拟牛顿法
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred)) # 打印报告

线性回归和逻辑回归的区别

线性回归的样本的输出，都是连续值， $线性和非线性回归分析 R 线性回归与非线性回归_线性和非线性回归分析 R_53$ ，而逻辑回归中 $线性和非线性回归分析 R 线性回归与非线性回归_线性回归_54$ ，只能取0和1。
对于拟合函数也有本质上的差别：

线性回归： $线性和非线性回归分析 R 线性回归与非线性回归_逻辑回归_55$
逻辑回归： $线性和非线性回归分析 R 线性回归与非线性回归_线性回归_56$ ，其中， $线性和非线性回归分析 R 线性回归与非线性回归_线性和非线性回归分析 R_57$

线性回归的拟合函数，是对 $线性和非线性回归分析 R 线性回归与非线性回归_逻辑回归_58$ 的输出变量 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_04$ 的拟合，而逻辑回归的拟合函数是对为1类样本的概率的拟合。

为什么要以1类样本的概率进行拟合呢，为什么可以这样拟合呢？ $线性和非线性回归分析 R 线性回归与非线性回归_逻辑回归_60$ 就相当于是1类和0类的决策边界：当 $线性和非线性回归分析 R 线性回归与非线性回归_逻辑回归_61$ ，则 $线性和非线性回归分析 R 线性回归与非线性回归_数据_62$ ；若 $线性和非线性回归分析 R 线性回归与非线性回归_线性和非线性回归分析 R_63$ ，则 $线性和非线性回归分析 R 线性回归与非线性回归_数据_64$ ，即 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_65$ 为1类;当 $线性和非线性回归分析 R 线性回归与非线性回归_线性回归_66$ ，则 $线性和非线性回归分析 R 线性回归与非线性回归_数据_67$ ；若 $线性和非线性回归分析 R 线性回归与非线性回归_线性回归_68$ ，则 $线性和非线性回归分析 R 线性回归与非线性回归_线性回归_69$ ，即 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_65$ 为0类;
这个时候就能看出区别，在线性回归中 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_71$ 为预测值的拟合函数；而在逻辑回归中 $线性和非线性回归分析 R 线性回归与非线性回归_机器学习_71$ 为决策边界。下表为线性回归和逻辑回归的区别。

	线性回归	逻辑回归
目的	预测	分类
$线性和非线性回归分析 R 线性回归与非线性回归_线性回归_73$	未知	（0,1）
函数	拟合函数	预测函数
参数计算方式	最小二乘法	极大似然估计

下面具体解释一下：

拟合函数和预测函数什么关系呢？简单来说就是将拟合函数做了一个逻辑函数的转换，转换后使得 $线性和非线性回归分析 R 线性回归与非线性回归_线性回归_74$ ;
最小二乘和最大似然估计可以相互替代吗？回答当然是不行了。我们来看看两者依仗的原理：最大似然估计是计算使得数据出现的可能性最大的参数，依仗的自然是Probability。而最小二乘是计算误差损失。