本文使用矩阵求导的方法,研究一种含有多个隐藏层的 BP 神经网络在学习过程中连接权和偏置项是如何更新的。在给出了矩阵形式的公式同时,给出了一种可能的 Python 实现。
本文研究一种含有多个隐藏层的 BP 神经网络(Backpropagation Neural Network),其在学习过程中,连接权和偏置项是如何更新的。
1 神经网络模型
本文研究的神经网络具有 \(m\) 层,其中第 1 层为输入层,用于接收数据;第 \(m\) 层为输出层,用于输出结果;第 \(1\sim(m-1)\) 层为隐藏层。其中,第 \(k\) 层拥有 \(n_k\) 个神经元 \((k=1,2,\cdots,m)\)。
BP 神经网络是全连接的,每个神经元都有两组数据,分别是该神经元的输入和输出。所以神经网络中的第 \(k\) 层有输入向量 \(I_k\) 和输出向量 \(O_k\),长度分别为 \(n_k\),\(n_{k+1}\)。此外,每层神经网络对来自上一层的数据做一次偏置,即 \(I_{k+1}\) 会在 \(O_k\) 加权和的基础上减去一个偏置向量 \(\bm{b}_k\),或者说,偏置向量 \(\bm{b}_k\) 作用于第 \(k+1\) 层,因此只有 \(m-1\) 个,长度为 \(n_{k+1}\)。
2 前向传播
数据以向量 \(I_1\) 的形式输入神经网络,然后利用激活函数 \(f(I_1)\) 得到向量 \(O_1\);然后利用第 1 层和第 2 层之间的连接权 \(W_1\) 对 \(O_1\) 加权后偏置,作为第 2 层的输入 \(I_2\),以此类推,直到得到输出向量 \(O_m\)。
以 \(O_k=(o_1,o_2,\cdots,o_{n_k})\) 为例,计算 \(I_{k+1}\):
\[\begin{align}I_{k+1}&=\begin{bmatrix}o_1&o_2&\cdots&o_{n_k}\end{bmatrix}\begin{bmatrix}w_{11}&w_{12}&\cdots&w_{1n_{k+1}}\\w_{21}&w_{22}&\cdots&w_{2n_{k+1}}\\\vdots&\vdots&\ddots&\vdots\\w_{n_k1}&w_{n_k2}&\cdots&w_{n_kn_{k+1}}\end{bmatrix}-\begin{bmatrix}b_{k,1}&b_{k,2}&\cdots&b_{k,n_{k+1}}\end{bmatrix}\nonumber\\\nonumber\\&=O_k W_k-\bm{b}_k\nonumber\end{align}\]
可见连接权矩阵 \(W_k\) 是一个 \(n_k\) 行 \(n_{k+1}\)
根据上述原理,可以得到以下递推:
\[\begin{cases}I_{k+1}&=O_k W_k-\bm{b}_k&,k=1,2,\cdots,m-1\\\\O_k&=f(I_k)&,k=1,2,\cdots,m\end{cases}\]
输出向量 \(O_m\) 与标签 \(y\) 之间存在误差,利用损失函数 \(E(O_m)\)
3 反向传播
得到模型损失以后,沿其负梯度方向更新权值和偏置可以将模型误差缩小。这一过程叫做反向传播(Backpropagation),其核心是利用梯度下降(Gradient Decent)的思想。因为损失函数 \(E\) 是一个标量,下面用标量对矩阵求导的方法求其对权值 \(W_k\) 和偏置 \(\bm{b}_k\)
已知标函数对矩阵变元的导数与微分之间存在关系:
\[\mathrm{d}y=tr\left(\left(\frac{\partial y}{\partial X}\right)^T\mathrm{d}X\right)\]
矢函数对矢变元的导数与微分之间存在关系:
\[\mathrm{d}\bm{y}=\left(\frac{\partial \bm{y}}{\partial \bm{x}}\right)^T\mathrm{d}\bm{x}\]
其中 \(\bm{y}\) 与 \(\bm{x}\) 均为列向量,\(\dfrac{\partial \bm{y}}{\partial \bm{x}}\)
其中,\(tr\) 表示矩阵的迹。并且显然有以下事实:
- \(O_k=f(I_k)\),所以微分 \((\mathrm{d}O_k)^T=\left(\dfrac{\partial O_k}{\partial I_k}\right)^T(\mathrm{d}I_k)^T\),所以 \(\mathrm{d}O_k=(\mathrm{d}I_k)\dfrac{\partial O_k}{\partial I_k}\)。
- \(I_{k+1}=O_k W_k-\bm{b}_k\),所以微分 \(\mathrm{d}I_{k+1}=(\mathrm{d}O_k)W_k+O_k(\mathrm{d}W_k)-\mathrm{d}\bm{b}_k\)。
那么利用矩阵的等价变换 \(tr(ABC)=tr(CAB)\) 和 \((AB)^T=B^T A^T\),有:
\[\begin{align}\mathrm{d}E&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\right)^T\mathrm{d}O_{k+1}\right)\nonumber\\\nonumber\\&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\right)^T(\mathrm{d}I_{k+1})\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)\nonumber\\\nonumber\\&=tr\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\left(\frac{\partial E}{\partial O_{k+1}}\right)^T\mathrm{d}I_{k+1}\right)\nonumber\\\nonumber\\&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T\mathrm{d}I_{k+1}\right)\nonumber\end{align}\]
若要求 \(\dfrac{\partial E}{\partial W_k}\),则 \(\mathrm{d}O_k=0\),\(\mathrm{d}b_k=0\),则 \(\mathrm{d}I_{k+1}=O_k\mathrm{d}W_k\),从而
\[\begin{align}\mathrm{d}E&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T\mathrm{d}I_{k+1}\right)\nonumber\\\nonumber\\&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T(O_k\mathrm{d}W_k)\right)\nonumber\\\nonumber\\&=tr\left(\left(O_k^T\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T\mathrm{d}W_k\right)\nonumber\end{align}\]
同样地,对于 \(\dfrac{\partial E}{\partial \bm{b}_k}\),
\[\begin{align}\mathrm{d}E&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T(-\mathrm{d}\bm{b}_k)\right)\nonumber\\\nonumber\\&=tr\left(\left(-\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T\mathrm{d}\bm{b}_k\right)\nonumber\end{align}\]
对于 \(\dfrac{\partial E}{\partial O_k}\),
\[\begin{align}\mathrm{d}E&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T(\mathrm{d}O_k)W_k\right)\nonumber\\\nonumber\\&=tr\left(W_k\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T\mathrm{d}O_k\right)\nonumber\\\nonumber\\&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T W_k^T\right)^T\mathrm{d}O_k\right)\nonumber\end{align}\]
从而求出损失函数的梯度
\[\begin{cases}\dfrac{\partial E}{\partial W_k}&=O_k^T\left(\dfrac{\partial E}{\partial O_{k+1}}\left(\dfrac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)\\\\\dfrac{\partial E}{\partial \bm{b}_k}&=-\dfrac{\partial E}{\partial O_{k+1}}\left(\dfrac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\\\\\dfrac{\partial E}{\partial O_k}&=\left(\dfrac{\partial E}{\partial O_{k+1}}\left(\dfrac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)W_k^T\end{cases}\]
发现,计算过程中的关键项 \(\dfrac{\partial E}{\partial O_{k+1}}\) 可以由上述第 3 式递推得出,并且 \(\dfrac{\partial E}{\partial O_m}=E'(O_m)\)
4 激活函数与损失函数
Sigmod
当激活函数为 \(\sigma(x)=1/\left(1+\mathrm{e}^{-x}\right)\)
\[\begin{align}\sigma'(x)&=-\frac1{(1+\mathrm{e}^{-x})^2}\cdot\mathrm{e}^{-x}\cdot(-1)\nonumber\\\nonumber\\&=\frac1{1+\mathrm{e}^{-x}}\cdot\frac{\mathrm{e}^{-x}}{1+\mathrm{e}^{-x}}\nonumber\\\nonumber\\&=\sigma(x)\left(1-\sigma(x)\right)\nonumber\end{align}\]
此时
\[\begin{align}\frac{\partial O_k}{\partial I_k}&=\begin{bmatrix}\dfrac{\partial O_{k,1}}{\partial I_{k,1}}&\dfrac{\partial O_{k,1}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,1}}{\partial I_{k,n_k}}\\\dfrac{\partial O_{k,2}}{\partial I_{k,1}}&\dfrac{\partial O_{k,2}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,2}}{\partial I_{k,n_k}}\\\vdots&\vdots&\ddots&\vdots\\\dfrac{\partial O_{k,n_k}}{\partial I_{k,1}}&\dfrac{\partial O_{k,n_k}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,n_k}}{\partial I_{k,n_k}}\end{bmatrix}\nonumber\\\nonumber\\&=\begin{bmatrix}O_{k,1}(1-O_{k,1})&0&\cdots&0\\0&O_{k,2}(1-O_{k,2})&\cdots&0\\\vdots&\vdots&\ddots&\vdots\\0&0&\cdots&O_{k,n_k}(1-O_{k,n_k})\end{bmatrix}\nonumber\end{align}\]
Tanh
当激活函数为 \(\tanh(x)=\left(\mathrm{e}^{x}-\mathrm{e}^{-x}\right)/\left(\mathrm{e}^{x}+\mathrm{e}^{-x}\right)=2\sigma(2x)-1\)
\[\begin{align}\tanh'(x)&=2\sigma'(2x)\cdot2\nonumber\\&=4\sigma(2x)\left[1-\sigma(2x)\right]\nonumber\\&=1-\tanh^2(x)\nonumber\end{align}\]
此时
\[\begin{align}\frac{\partial O_k}{\partial I_k}&=\begin{bmatrix}\dfrac{\partial O_{k,1}}{\partial I_{k,1}}&\dfrac{\partial O_{k,1}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,1}}{\partial I_{k,n_k}}\\\dfrac{\partial O_{k,2}}{\partial I_{k,1}}&\dfrac{\partial O_{k,2}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,2}}{\partial I_{k,n_k}}\\\vdots&\vdots&\ddots&\vdots\\\dfrac{\partial O_{k,n_k}}{\partial I_{k,1}}&\dfrac{\partial O_{k,n_k}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,n_k}}{\partial I_{k,n_k}}\end{bmatrix}\nonumber\\\nonumber\\&=\begin{bmatrix}1-O_{k,1}^2&0&\cdots&0\\0&1-O_{k,2}^2&\cdots&0\\\vdots&\vdots&\ddots&\vdots\\0&0&\cdots&1-O_{k,n_k}^2\end{bmatrix}\nonumber\end{align}\]
Softmax
在多分类任务中,输出层常采用 Softmax 函数 \(S(x)=\mathrm{e}^{x_i}/(\sum\mathrm{e}^{x_i})\),此时
\[\frac{\partial O_k}{\partial I_k}=\begin{bmatrix}\dfrac{\partial O_{k,1}}{\partial I_{k,1}}&\dfrac{\partial O_{k,1}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,1}}{\partial I_{k,n_k}}\\\dfrac{\partial O_{k,2}}{\partial I_{k,1}}&\dfrac{\partial O_{k,2}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,2}}{\partial I_{k,n_k}}\\\vdots&\vdots&\ddots&\vdots\\\dfrac{\partial O_{k,n_k}}{\partial I_{k,1}}&\dfrac{\partial O_{k,n_k}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,n_k}}{\partial I_{k,n_k}}\end{bmatrix}\]
分为两种情况:
- 若 \(i=j\),则\[\begin{align}\frac{\partial O_{k,i}}{\partial I_{k,j}}&=\frac{\mathrm{e}^{I_{k,i}}\left(\sum\mathrm{e}^{I_{k,i}}\right)-\mathrm{e}^{I_{k,i}}\cdot\mathrm{e}^{I_{k,i}}}{\left(\sum\mathrm{e}^{I_{k,i}}\right)^2}\nonumber\\\nonumber\\&=\frac{\mathrm{e}^{I_{k,i}}}{\left(\sum\mathrm{e}^{I_{k,i}}\right)}\frac{\left(\sum\mathrm{e}^{I_{k,i}}\right)-\mathrm{e}^{I_{k,i}}}{\left(\sum\mathrm{e}^{I_{k,i}}\right)}\nonumber\\\nonumber\\&=S(I_{k,i})\left(1-S(I_{k,i})\right)\nonumber\\\nonumber\\&=O_{k,i}(1-O_{k,i})\nonumber\end{align}\]
- 若 \(i\neq j\),则\[\begin{align}\frac{\partial O_{k,i}}{\partial I_{k,j}}&=\frac{0-\mathrm{e}^{I_{k,j}}\cdot\mathrm{e}^{I_{k,i}}}{\left(\sum\mathrm{e}^{I_{k,i}}\right)^2}\nonumber\\\nonumber\\&=-\frac{\mathrm{e}^{I_{k,j}}}{\left(\sum\mathrm{e}^{I_{k,i}}\right)}\frac{\mathrm{e}^{I_{k,j}}}{\left(\sum\mathrm{e}^{I_{k,i}}\right)}\nonumber\\\nonumber\\&=-S(I_{k,j})S(I_{k,j})\nonumber\\\nonumber\\&=-O_{k,j}O_{k,i}\nonumber\end{align}\]
所以对于 Softmax 函数而言,
\[\frac{\partial O_k}{\partial I_k}=\begin{bmatrix}O_{k,1}(1-O_{k,1})&-O_{k,2}O_{k,1}&\cdots&-O_{k,n_k}O_{k,1}\\-O_{k,1}O_{k,2}&O_{k,2}(1-O_{k,2})&\cdots&-O_{k,n_k}O_{k,2}\\\vdots&\vdots&\ddots&\vdots\\-O_{k,1}O_{k,n_k}&-O_{k,2}O_{k,n_k}&\cdots&O_{k,n_k}(1-O_{k,n_k})\end{bmatrix}\]
MSE
如果选择的损失函数为均方误差(Mean Square Error)损失 \(MSE(\hat{y})=1/2\sum(y_i-\hat{y}_i)^2\),其导数为
\[E'(\hat{y})=-(y-\hat{y})\]
其中 \(y\)
Cross Entyopy(交叉熵)
同样地,如果选择的损失函数为交叉熵(Cross Entropy)损失 \(CE(\hat{y})=-\sum(y_i\ln(\hat{y}_i))\),其导数为
\[E'(\hat{y})=-\frac{y}{\hat{y}}\]
5 一种可能的 Python 实现
实现一个 BP 神经网络对 Iris 数据集进行多分类任务,神经元激活函数使用 sigmod 函数 \(\sigma(x)=1/\left(1+\mathrm{e}^{-x}\right)\),输出层使用 Softmax 函数 \(S(x)=\mathrm{e}^{x_i}/(\sum\mathrm{e}^{x_i})\),损失函数使用交叉熵 \(E(\hat{y})=-\sum_{i=1}^{2}y_i\ln(\hat{y}_i)\),使用 Python 实现,需要 NumPy,Pandas 和 sklearn。
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
def sigmod(x, *, derivate=False):
if derivate:
return x * (1 - x)
else:
return 1 / (1 + np.exp(-x))
def softmax(x):
return np.exp(x) /np.sum(np.exp(x))
def cross_entropy(predict, label, *, multi_classify=True):
if multi_classify:
return -np.sum(label * np.log(predict + 1e-6))
elif label == 1:
return -np.sum(np.log(predict + 1e-6))
else:
-np.sum(np.log(1 - predict + 1e-6))
class BPNN:
def __init__(self, input_: int, output: int, hiden: list, *,
learn_rate=0.005):
# 描述网络结构
self.struct = [input_] + hiden + [output]
self.__layer = len(hiden) + 2 # 记录网络层数,避免遍历 struct
# 存储神经元数据
self.cells = [np.ones(shape=(1, self.struct[i]))
for i in range(self.__layer)]
self.labels = None
__xavier = np.sqrt(6 / (input_ + output + 6)) # 使用 xavier 初始化
self.weights = [
np.random.uniform(
low=-__xavier, high=__xavier,
size=(self.struct[i], self.struct[i + 1])
)
for i in range(self.__layer - 1)
]
self.bias = [np.ones(shape=(1, self.struct[i]))
for i in range(1, self.__layer)]
# 其它设置
self.learn_rate = learn_rate
self.active_func = sigmod
self.loss_func = cross_entropy
def load_data(self, source_data: list, source_label: list):
for i, data in enumerate(source_data):
self.cells[0][0, i] = data
self.labels = np.array([source_label])
def predict(self, *, outside_data: list = None):
vec_afunc = np.vectorize(self.active_func)
if outside_data is not None:
self.cells[0] = np.array([outside_data])
for i in range(self.__layer - 1):
self.cells[i + 1] = vec_afunc(
np.matmul(self.cells[i], self.weights[i])- self.bias[i])
self.cells[self.__layer - 1] = softmax(
np.matmul(
self.cells[self.__layer - 2],
self.weights[self.__layer - 2]
) -
self.bias[self.__layer - 2])
return self.cells[-1]
def back_propagate(self):
vec_afunc = np.vectorize(self.active_func)
# 递推计算 E 对 O_k 的梯度
delta = np.matmul(self.labels, self.cells[-1].T) - self.labels
for i in range(self.__layer - 2, -1, -1):
__delta_cache = delta * vec_afunc(self.cells[i + 1], derivate=True)
delta = np.matmul(__delta_cache, self.weights[i].T)
self.weights[i] -= self.learn_rate * np.matmul(self.cells[i].T,
__delta_cache)
self.bias[i] += self.learn_rate * __delta_cache
def train(self, data, label, *, limit=10001):
source_index = np.arange(0, len(data))
for i in range(limit):
# np.random.shuffle(source_index)
for j in source_index:
self.load_data(data[j, :], label[j, :])
self.predict()
self.back_propagate()
if i % 200 == 0:
loss = self.loss_func(self.cells[-1], self.labels)
print(f'Round {i}, Loss: {loss:.8f}')
def test(self, data_t, label_t):
source_len = len(data_t)
for i in range(source_len):
self.load_data(data_t[i, :], label_t[i, :])
output = self.predict()
res = output.argmax(axis=1)
print(f'Output: {np.around(output, decimals=4)}, ',
f'Result: {res}, ',
f'Label: {label_t[i, :].argmax(axis=0)}')
if __name__ == '__main__':
X_n, Y_n = 4, 3
source = pd.DataFrame(data=load_iris().data,
columns=load_iris().feature_names)
target = pd.Series(data=load_iris().target)
source.insert(loc=4, column='0', value=(target == 0).astype(int))
source.insert(loc=5, column='1', value=(target == 1).astype(int))
source.insert(loc=6, column='2', value=(target == 2).astype(int))
s_max, s_min = source.max(axis=0), source.min(axis=0)
source = (source - s_min) / (s_max - s_min)
source = source.sample(frac=1).reset_index(drop=True)
data, data_t = source.values[:130, :4], source.values[130:, :4]
label, label_t = source.values[:130, 4:], source.values[130:, 4:]
bp_model = BPNN(X_n, Y_n, [10, 3], learn_rate=0.003)
bp_model.train(data, label, limit=2001)
bp_model.test(data_t, label_t)
参考文献:
[1] 长躯鬼侠. 矩阵求导术(上)[EB/OL]. https://zhuanlan.zhihu.com/p/24709748, 2020
[2] 触摸壹缕阳光. 一文详解Softmax函数[EB/OL]. https://zhuanlan.zhihu.com/p/105722023, 2021
[3] devyitian. 神经网络中的激活函数(activation function)[EB/OL]. https://zhuanlan.zhihu.com/p/80144138, 2019
[4] 飞鱼Talk. 损失函数 | Mean-Squared Loss[EB/OL]. https://zhuanlan.zhihu.com/p/35707643, 2022
[5] 飞鱼Talk. 损失函数|交叉熵损失函数[EB/OL]. https://zhuanlan.zhihu.com/p/35709485, 2022