本文使用矩阵求导的方法,研究一种含有多个隐藏层的 BP 神经网络在学习过程中连接权和偏置项是如何更新的。在给出了矩阵形式的公式同时,给出了一种可能的 Python 实现。

本文研究一种含有多个隐藏层的 BP 神经网络(Backpropagation Neural Network),其在学习过程中,连接权和偏置项是如何更新的。

1 神经网络模型

本文研究的神经网络具有 \(m\) 层,其中第 1 层为输入层,用于接收数据;第 \(m\) 层为输出层,用于输出结果;第 \(1\sim(m-1)\) 层为隐藏层。其中,第 \(k\) 层拥有 \(n_k\) 个神经元 \((k=1,2,\cdots,m)\)。


正交化更新神经网络权值 学习率 神经网络的权值更新_损失函数

BP 神经网络是全连接的,每个神经元都有两组数据,分别是该神经元的输入和输出。所以神经网络中的第 \(k\) 层有输入向量 \(I_k\) 和输出向量 \(O_k\),长度分别为 \(n_k\),\(n_{k+1}\)。此外,每层神经网络对来自上一层的数据做一次偏置,即 \(I_{k+1}\) 会在 \(O_k\) 加权和的基础上减去一个偏置向量 \(\bm{b}_k\),或者说,偏置向量 \(\bm{b}_k\) 作用于第 \(k+1\) 层,因此只有 \(m-1\) 个,长度为 \(n_{k+1}\)。

2 前向传播

数据以向量 \(I_1\) 的形式输入神经网络,然后利用激活函数 \(f(I_1)\) 得到向量 \(O_1\);然后利用第 1 层和第 2 层之间的连接权 \(W_1\) 对 \(O_1\) 加权后偏置,作为第 2 层的输入 \(I_2\),以此类推,直到得到输出向量 \(O_m\)。

以 \(O_k=(o_1,o_2,\cdots,o_{n_k})\) 为例,计算 \(I_{k+1}\):

\[\begin{align}I_{k+1}&=\begin{bmatrix}o_1&o_2&\cdots&o_{n_k}\end{bmatrix}\begin{bmatrix}w_{11}&w_{12}&\cdots&w_{1n_{k+1}}\\w_{21}&w_{22}&\cdots&w_{2n_{k+1}}\\\vdots&\vdots&\ddots&\vdots\\w_{n_k1}&w_{n_k2}&\cdots&w_{n_kn_{k+1}}\end{bmatrix}-\begin{bmatrix}b_{k,1}&b_{k,2}&\cdots&b_{k,n_{k+1}}\end{bmatrix}\nonumber\\\nonumber\\&=O_k W_k-\bm{b}_k\nonumber\end{align}\]

可见连接权矩阵 \(W_k\) 是一个 \(n_k\) 行 \(n_{k+1}\)

根据上述原理,可以得到以下递推:

\[\begin{cases}I_{k+1}&=O_k W_k-\bm{b}_k&,k=1,2,\cdots,m-1\\\\O_k&=f(I_k)&,k=1,2,\cdots,m\end{cases}\]

输出向量 \(O_m\) 与标签 \(y\) 之间存在误差,利用损失函数 \(E(O_m)\)

3 反向传播

得到模型损失以后,沿其负梯度方向更新权值和偏置可以将模型误差缩小。这一过程叫做反向传播(Backpropagation),其核心是利用梯度下降(Gradient Decent)的思想。因为损失函数 \(E\) 是一个标量,下面用标量对矩阵求导的方法求其对权值 \(W_k\) 和偏置 \(\bm{b}_k\)

已知标函数对矩阵变元的导数与微分之间存在关系:

\[\mathrm{d}y=tr\left(\left(\frac{\partial y}{\partial X}\right)^T\mathrm{d}X\right)\]

矢函数对矢变元的导数与微分之间存在关系:

\[\mathrm{d}\bm{y}=\left(\frac{\partial \bm{y}}{\partial \bm{x}}\right)^T\mathrm{d}\bm{x}\]

其中 \(\bm{y}\) 与 \(\bm{x}\) 均为列向量,\(\dfrac{\partial \bm{y}}{\partial \bm{x}}\)

其中,\(tr\) 表示矩阵的迹。并且显然有以下事实:

  1. \(O_k=f(I_k)\),所以微分 \((\mathrm{d}O_k)^T=\left(\dfrac{\partial O_k}{\partial I_k}\right)^T(\mathrm{d}I_k)^T\),所以 \(\mathrm{d}O_k=(\mathrm{d}I_k)\dfrac{\partial O_k}{\partial I_k}\)。
  2. \(I_{k+1}=O_k W_k-\bm{b}_k\),所以微分 \(\mathrm{d}I_{k+1}=(\mathrm{d}O_k)W_k+O_k(\mathrm{d}W_k)-\mathrm{d}\bm{b}_k\)。

那么利用矩阵的等价变换 \(tr(ABC)=tr(CAB)\) 和 \((AB)^T=B^T A^T\),有:

\[\begin{align}\mathrm{d}E&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\right)^T\mathrm{d}O_{k+1}\right)\nonumber\\\nonumber\\&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\right)^T(\mathrm{d}I_{k+1})\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)\nonumber\\\nonumber\\&=tr\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\left(\frac{\partial E}{\partial O_{k+1}}\right)^T\mathrm{d}I_{k+1}\right)\nonumber\\\nonumber\\&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T\mathrm{d}I_{k+1}\right)\nonumber\end{align}\]

若要求 \(\dfrac{\partial E}{\partial W_k}\),则 \(\mathrm{d}O_k=0\),\(\mathrm{d}b_k=0\),则 \(\mathrm{d}I_{k+1}=O_k\mathrm{d}W_k\),从而

\[\begin{align}\mathrm{d}E&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T\mathrm{d}I_{k+1}\right)\nonumber\\\nonumber\\&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T(O_k\mathrm{d}W_k)\right)\nonumber\\\nonumber\\&=tr\left(\left(O_k^T\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T\mathrm{d}W_k\right)\nonumber\end{align}\]

同样地,对于 \(\dfrac{\partial E}{\partial \bm{b}_k}\),

\[\begin{align}\mathrm{d}E&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T(-\mathrm{d}\bm{b}_k)\right)\nonumber\\\nonumber\\&=tr\left(\left(-\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T\mathrm{d}\bm{b}_k\right)\nonumber\end{align}\]

对于 \(\dfrac{\partial E}{\partial O_k}\),

\[\begin{align}\mathrm{d}E&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T(\mathrm{d}O_k)W_k\right)\nonumber\\\nonumber\\&=tr\left(W_k\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)^T\mathrm{d}O_k\right)\nonumber\\\nonumber\\&=tr\left(\left(\frac{\partial E}{\partial O_{k+1}}\left(\frac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T W_k^T\right)^T\mathrm{d}O_k\right)\nonumber\end{align}\]

从而求出损失函数的梯度

\[\begin{cases}\dfrac{\partial E}{\partial W_k}&=O_k^T\left(\dfrac{\partial E}{\partial O_{k+1}}\left(\dfrac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)\\\\\dfrac{\partial E}{\partial \bm{b}_k}&=-\dfrac{\partial E}{\partial O_{k+1}}\left(\dfrac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\\\\\dfrac{\partial E}{\partial O_k}&=\left(\dfrac{\partial E}{\partial O_{k+1}}\left(\dfrac{\partial O_{k+1}}{\partial I_{k+1}}\right)^T\right)W_k^T\end{cases}\]

发现,计算过程中的关键项 \(\dfrac{\partial E}{\partial O_{k+1}}\) 可以由上述第 3 式递推得出,并且 \(\dfrac{\partial E}{\partial O_m}=E'(O_m)\)

4 激活函数与损失函数

Sigmod

当激活函数为 \(\sigma(x)=1/\left(1+\mathrm{e}^{-x}\right)\)

\[\begin{align}\sigma'(x)&=-\frac1{(1+\mathrm{e}^{-x})^2}\cdot\mathrm{e}^{-x}\cdot(-1)\nonumber\\\nonumber\\&=\frac1{1+\mathrm{e}^{-x}}\cdot\frac{\mathrm{e}^{-x}}{1+\mathrm{e}^{-x}}\nonumber\\\nonumber\\&=\sigma(x)\left(1-\sigma(x)\right)\nonumber\end{align}\]

此时

\[\begin{align}\frac{\partial O_k}{\partial I_k}&=\begin{bmatrix}\dfrac{\partial O_{k,1}}{\partial I_{k,1}}&\dfrac{\partial O_{k,1}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,1}}{\partial I_{k,n_k}}\\\dfrac{\partial O_{k,2}}{\partial I_{k,1}}&\dfrac{\partial O_{k,2}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,2}}{\partial I_{k,n_k}}\\\vdots&\vdots&\ddots&\vdots\\\dfrac{\partial O_{k,n_k}}{\partial I_{k,1}}&\dfrac{\partial O_{k,n_k}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,n_k}}{\partial I_{k,n_k}}\end{bmatrix}\nonumber\\\nonumber\\&=\begin{bmatrix}O_{k,1}(1-O_{k,1})&0&\cdots&0\\0&O_{k,2}(1-O_{k,2})&\cdots&0\\\vdots&\vdots&\ddots&\vdots\\0&0&\cdots&O_{k,n_k}(1-O_{k,n_k})\end{bmatrix}\nonumber\end{align}\]

Tanh

当激活函数为 \(\tanh(x)=\left(\mathrm{e}^{x}-\mathrm{e}^{-x}\right)/\left(\mathrm{e}^{x}+\mathrm{e}^{-x}\right)=2\sigma(2x)-1\)

\[\begin{align}\tanh'(x)&=2\sigma'(2x)\cdot2\nonumber\\&=4\sigma(2x)\left[1-\sigma(2x)\right]\nonumber\\&=1-\tanh^2(x)\nonumber\end{align}\]

此时

\[\begin{align}\frac{\partial O_k}{\partial I_k}&=\begin{bmatrix}\dfrac{\partial O_{k,1}}{\partial I_{k,1}}&\dfrac{\partial O_{k,1}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,1}}{\partial I_{k,n_k}}\\\dfrac{\partial O_{k,2}}{\partial I_{k,1}}&\dfrac{\partial O_{k,2}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,2}}{\partial I_{k,n_k}}\\\vdots&\vdots&\ddots&\vdots\\\dfrac{\partial O_{k,n_k}}{\partial I_{k,1}}&\dfrac{\partial O_{k,n_k}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,n_k}}{\partial I_{k,n_k}}\end{bmatrix}\nonumber\\\nonumber\\&=\begin{bmatrix}1-O_{k,1}^2&0&\cdots&0\\0&1-O_{k,2}^2&\cdots&0\\\vdots&\vdots&\ddots&\vdots\\0&0&\cdots&1-O_{k,n_k}^2\end{bmatrix}\nonumber\end{align}\]

Softmax

在多分类任务中,输出层常采用 Softmax 函数 \(S(x)=\mathrm{e}^{x_i}/(\sum\mathrm{e}^{x_i})\),此时

\[\frac{\partial O_k}{\partial I_k}=\begin{bmatrix}\dfrac{\partial O_{k,1}}{\partial I_{k,1}}&\dfrac{\partial O_{k,1}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,1}}{\partial I_{k,n_k}}\\\dfrac{\partial O_{k,2}}{\partial I_{k,1}}&\dfrac{\partial O_{k,2}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,2}}{\partial I_{k,n_k}}\\\vdots&\vdots&\ddots&\vdots\\\dfrac{\partial O_{k,n_k}}{\partial I_{k,1}}&\dfrac{\partial O_{k,n_k}}{\partial I_{k,2}}&\cdots&\dfrac{\partial O_{k,n_k}}{\partial I_{k,n_k}}\end{bmatrix}\]

分为两种情况:

  1. 若 \(i=j\),则\[\begin{align}\frac{\partial O_{k,i}}{\partial I_{k,j}}&=\frac{\mathrm{e}^{I_{k,i}}\left(\sum\mathrm{e}^{I_{k,i}}\right)-\mathrm{e}^{I_{k,i}}\cdot\mathrm{e}^{I_{k,i}}}{\left(\sum\mathrm{e}^{I_{k,i}}\right)^2}\nonumber\\\nonumber\\&=\frac{\mathrm{e}^{I_{k,i}}}{\left(\sum\mathrm{e}^{I_{k,i}}\right)}\frac{\left(\sum\mathrm{e}^{I_{k,i}}\right)-\mathrm{e}^{I_{k,i}}}{\left(\sum\mathrm{e}^{I_{k,i}}\right)}\nonumber\\\nonumber\\&=S(I_{k,i})\left(1-S(I_{k,i})\right)\nonumber\\\nonumber\\&=O_{k,i}(1-O_{k,i})\nonumber\end{align}\]
  2. 若 \(i\neq j\),则\[\begin{align}\frac{\partial O_{k,i}}{\partial I_{k,j}}&=\frac{0-\mathrm{e}^{I_{k,j}}\cdot\mathrm{e}^{I_{k,i}}}{\left(\sum\mathrm{e}^{I_{k,i}}\right)^2}\nonumber\\\nonumber\\&=-\frac{\mathrm{e}^{I_{k,j}}}{\left(\sum\mathrm{e}^{I_{k,i}}\right)}\frac{\mathrm{e}^{I_{k,j}}}{\left(\sum\mathrm{e}^{I_{k,i}}\right)}\nonumber\\\nonumber\\&=-S(I_{k,j})S(I_{k,j})\nonumber\\\nonumber\\&=-O_{k,j}O_{k,i}\nonumber\end{align}\]

所以对于 Softmax 函数而言,

\[\frac{\partial O_k}{\partial I_k}=\begin{bmatrix}O_{k,1}(1-O_{k,1})&-O_{k,2}O_{k,1}&\cdots&-O_{k,n_k}O_{k,1}\\-O_{k,1}O_{k,2}&O_{k,2}(1-O_{k,2})&\cdots&-O_{k,n_k}O_{k,2}\\\vdots&\vdots&\ddots&\vdots\\-O_{k,1}O_{k,n_k}&-O_{k,2}O_{k,n_k}&\cdots&O_{k,n_k}(1-O_{k,n_k})\end{bmatrix}\]

MSE

如果选择的损失函数为均方误差(Mean Square Error)损失 \(MSE(\hat{y})=1/2\sum(y_i-\hat{y}_i)^2\),其导数为

\[E'(\hat{y})=-(y-\hat{y})\]

其中 \(y\)

Cross Entyopy(交叉熵)

同样地,如果选择的损失函数为交叉熵(Cross Entropy)损失 \(CE(\hat{y})=-\sum(y_i\ln(\hat{y}_i))\),其导数为

\[E'(\hat{y})=-\frac{y}{\hat{y}}\]

5 一种可能的 Python 实现

实现一个 BP 神经网络对 Iris 数据集进行多分类任务,神经元激活函数使用 sigmod 函数 \(\sigma(x)=1/\left(1+\mathrm{e}^{-x}\right)\),输出层使用 Softmax 函数 \(S(x)=\mathrm{e}^{x_i}/(\sum\mathrm{e}^{x_i})\),损失函数使用交叉熵 \(E(\hat{y})=-\sum_{i=1}^{2}y_i\ln(\hat{y}_i)\),使用 Python 实现,需要 NumPy,Pandas 和 sklearn。

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris


def sigmod(x, *, derivate=False):
    if derivate:
        return x * (1 - x)
    else:
        return 1 / (1 + np.exp(-x))


def softmax(x):
    return np.exp(x) /np.sum(np.exp(x))


def cross_entropy(predict, label, *, multi_classify=True):
    if multi_classify:
        return -np.sum(label * np.log(predict + 1e-6))
    elif label == 1:
        return -np.sum(np.log(predict + 1e-6)) 
    else:
         -np.sum(np.log(1 - predict + 1e-6))


class BPNN:
    def __init__(self, input_: int, output: int, hiden: list, *,
                 learn_rate=0.005):
        # 描述网络结构
        self.struct = [input_] + hiden + [output]
        self.__layer = len(hiden) + 2  # 记录网络层数,避免遍历 struct

        # 存储神经元数据
        self.cells = [np.ones(shape=(1, self.struct[i]))
                      for i in range(self.__layer)]
        
        self.labels = None
        __xavier = np.sqrt(6 / (input_ + output + 6))  # 使用 xavier 初始化
        self.weights = [
            np.random.uniform(
                low=-__xavier, high=__xavier,
                size=(self.struct[i], self.struct[i + 1])
            )
            for i in range(self.__layer - 1)
        ]
        
        self.bias = [np.ones(shape=(1, self.struct[i]))
                     for i in range(1, self.__layer)]

        # 其它设置
        self.learn_rate = learn_rate
        self.active_func = sigmod
        self.loss_func = cross_entropy

    def load_data(self, source_data: list, source_label: list):
        for i, data in enumerate(source_data):
            self.cells[0][0, i] = data
        self.labels = np.array([source_label])

    def predict(self, *, outside_data: list = None):
        vec_afunc = np.vectorize(self.active_func)
        if outside_data is not None:
            self.cells[0] = np.array([outside_data])
        for i in range(self.__layer - 1):
            self.cells[i + 1] = vec_afunc(
                np.matmul(self.cells[i], self.weights[i])- self.bias[i])

        self.cells[self.__layer - 1] = softmax(
            np.matmul(
                self.cells[self.__layer - 2],
                self.weights[self.__layer - 2]
            ) - 
            self.bias[self.__layer - 2])

        return self.cells[-1]

    def back_propagate(self):
        vec_afunc = np.vectorize(self.active_func)
        # 递推计算 E 对 O_k 的梯度
        delta = np.matmul(self.labels, self.cells[-1].T) - self.labels
        for i in range(self.__layer - 2, -1, -1):
            __delta_cache = delta * vec_afunc(self.cells[i + 1], derivate=True)
            delta = np.matmul(__delta_cache, self.weights[i].T)
            self.weights[i] -= self.learn_rate * np.matmul(self.cells[i].T,
                                                           __delta_cache)
            self.bias[i] += self.learn_rate * __delta_cache

    def train(self,  data, label, *, limit=10001):
        source_index = np.arange(0, len(data))
        for i in range(limit):
            # np.random.shuffle(source_index)
            for j in source_index:
                self.load_data(data[j, :], label[j, :])
                self.predict()
                self.back_propagate()

            if i % 200 == 0:
                loss = self.loss_func(self.cells[-1], self.labels)
                print(f'Round {i}, Loss: {loss:.8f}')

    def test(self, data_t, label_t):
        source_len = len(data_t)
        for i in range(source_len):
            self.load_data(data_t[i, :], label_t[i, :])
            output = self.predict()
            res = output.argmax(axis=1)
            print(f'Output: {np.around(output, decimals=4)}, ',
                  f'Result: {res}, ',
                  f'Label: {label_t[i, :].argmax(axis=0)}')


if __name__ == '__main__':
    X_n, Y_n = 4, 3
    source = pd.DataFrame(data=load_iris().data,
                          columns=load_iris().feature_names)
    target = pd.Series(data=load_iris().target)
    source.insert(loc=4, column='0', value=(target == 0).astype(int))
    source.insert(loc=5, column='1', value=(target == 1).astype(int))
    source.insert(loc=6, column='2', value=(target == 2).astype(int))
    s_max, s_min = source.max(axis=0), source.min(axis=0)
    source = (source - s_min) / (s_max - s_min)
    source = source.sample(frac=1).reset_index(drop=True)

    data, data_t = source.values[:130, :4], source.values[130:, :4]
    label, label_t = source.values[:130, 4:], source.values[130:, 4:]

    bp_model = BPNN(X_n, Y_n, [10, 3], learn_rate=0.003)
    bp_model.train(data, label, limit=2001)
    bp_model.test(data_t, label_t)

参考文献

[1] 长躯鬼侠. 矩阵求导术(上)[EB/OL]. https://zhuanlan.zhihu.com/p/24709748, 2020

[2] 触摸壹缕阳光. 一文详解Softmax函数[EB/OL]. https://zhuanlan.zhihu.com/p/105722023, 2021

[3] devyitian. 神经网络中的激活函数(activation function)[EB/OL]. https://zhuanlan.zhihu.com/p/80144138, 2019

[4] 飞鱼Talk. 损失函数 | Mean-Squared Loss[EB/OL]. https://zhuanlan.zhihu.com/p/35707643, 2022

[5] 飞鱼Talk. 损失函数|交叉熵损失函数[EB/OL]. https://zhuanlan.zhihu.com/p/35709485, 2022