MC 贝叶斯神经网络贝叶斯神经网络原理

转载

mob64ca13fe1aa6 2023-09-04 12:49:25

文章标签 MC 贝叶斯神经网络神经网络权重反向传播 文章分类 神经网络人工智能

文章目录

前言
什么是贝叶斯神经网络
How to train BNN
BNN背后的数学原理
pytorch实现BNN
参考文献

前言

本文将总结贝叶斯神经网络，首先，我将简单介绍一下什么是贝叶斯神经网络（BNN）；接着我将介绍BNN是怎么训练的；然后我会介绍BNN背后的运作原理；最后，我将给出利用pytorch实现的BNN代码。

什么是贝叶斯神经网络

MC 贝叶斯神经网络贝叶斯神经网络原理_神经网络

如上图，上半部分是反向传播网络，下半部分是贝叶斯神经网络。反向传播网络在优化完毕后，其权重是一个固定的值，而贝叶斯神经网络把权重看成是服从均值为 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_02$ ，方差为 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_03$ 的高斯分布，每个权重服从不同的高斯分布，反向传播网络优化的是权重，贝叶斯神经网络优化的是权重的均值和方差，所以贝叶斯神经网络需要优化的参数是反向传播网络的两倍。

在预测时，BNN会从每个高斯分布中进行采样，得到权重值，此时贝叶斯神经网络就相当于一个反向传播网络。也可以进行多次采样，从而得到多次预测结果，将多次预测结果进行平均，从而得到最终的预测结果（就像是ensemble模型）

How to train BNN

设训练集D为{ $MC 贝叶斯神经网络贝叶斯神经网络原理_神经网络_04$ }， $MC 贝叶斯神经网络贝叶斯神经网络原理_权重_05$ ，第 $MC 贝叶斯神经网络贝叶斯神经网络原理_神经网络_06$ 个权重 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_07$ 服从均值为 $MC 贝叶斯神经网络贝叶斯神经网络原理_神经网络_08$ ，方差为 $MC 贝叶斯神经网络贝叶斯神经网络原理_神经网络_09$ 的高斯分布， $MC 贝叶斯神经网络贝叶斯神经网络原理_权重_10$ = $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_11$ ，BNN的输出为 $MC 贝叶斯神经网络贝叶斯神经网络原理_反向传播_12$ ，BNN含有n个权重，则BNN的损失函数为

$MC 贝叶斯神经网络贝叶斯神经网络原理_反向传播_13$

BNN假设 $MC 贝叶斯神经网络贝叶斯神经网络原理_神经网络_14$ , $MC 贝叶斯神经网络贝叶斯神经网络原理_权重_15$ , $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_16$ 均服从高斯分布，由于中心极限定理的存在，这个假设是比较合理的。 $MC 贝叶斯神经网络贝叶斯神经网络原理_反向传播_17$ 均为超参数，需要自己指定。

我们通过采样得到 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_07$ 的具体值，从而计算1.0的各个式子。由于很难直接从 $MC 贝叶斯神经网络贝叶斯神经网络原理_神经网络_14$ 采样，我们首先从标准正态分布中采样得到 $MC 贝叶斯神经网络贝叶斯神经网络原理_反向传播_20$ ，接着计算 $MC 贝叶斯神经网络贝叶斯神经网络原理_权重_21$ ，从而得到服从 $MC 贝叶斯神经网络贝叶斯神经网络原理_神经网络_14$ 分布的样本。

注意 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_23$ 是针对一个训练样本 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_24$ 计算的，我们可以通过反向传播优化 $MC 贝叶斯神经网络贝叶斯神经网络原理_权重_10$ ，但是反向传播有可能使得方差 $MC 贝叶斯神经网络贝叶斯神经网络原理_神经网络_09$ 小于0，因此BNN对 $MC 贝叶斯神经网络贝叶斯神经网络原理_神经网络_09$ 进行了特殊处理，如下所示
$MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_28$
因此BNN的优化参数变为 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_29$ 。

BNN背后的数学原理

神经网络用于建模分布 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_30$ ，即给定数据x，输出预测值y的分布，在分类任务中这个分布对应各个类别的概率；在回归任务中，一般认为是标准差固定的高斯分布，取均值作为预测结果。设训练集为 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_31$ ，我们对 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_30$ 进行如下变化

$MC 贝叶斯神经网络贝叶斯神经网络原理_权重_33$

$MC 贝叶斯神经网络贝叶斯神经网络原理_权重_34$ 表示给定权重 $MC 贝叶斯神经网络贝叶斯神经网络原理_反向传播_35$ 和输入x，输出y的概率分布，其实就是神经网络。我们只需要依据训练集 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_31$ 建模出权重的分布 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_37$ ，就可以依据蒙特卡罗方法，采样m个服从 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_37$ 分布的样本，计算 $MC 贝叶斯神经网络贝叶斯神经网络原理_权重_39$ ,即可得到 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_30$ 。

可是 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_37$ 往往难以计算，因此BNN使用变分估计，利用一个分布 $MC 贝叶斯神经网络贝叶斯神经网络原理_权重_42$ 来逼近 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_37$ ，利用KL散度度量 $MC 贝叶斯神经网络贝叶斯神经网络原理_权重_42$ 、 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_37$ 两个分布之间的相似性，则有

$MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_46$

由于 $MC 贝叶斯神经网络贝叶斯神经网络原理_权重_47$ 为已知分布，所以只需要最小化

$MC 贝叶斯神经网络贝叶斯神经网络原理_神经网络_48$

一般训练集由标签Y和数据X两部分组成，故有
$MC 贝叶斯神经网络贝叶斯神经网络原理_权重_49$

由于 $MC 贝叶斯神经网络贝叶斯神经网络原理_权重_50$ 为一个已知分布，将3.0代入2.0，优化目标变为

$MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_51$

给定一个训练数据 $MC 贝叶斯神经网络贝叶斯神经网络原理_反向传播_52$ ，对4.0使用蒙特卡罗法则可得

$MC 贝叶斯神经网络贝叶斯神经网络原理_反向传播_53$

假设5.0的m=1，BNN含有n个权重，第i个权重为 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_07$ ，第i个权重的高斯分布参数为 $MC 贝叶斯神经网络贝叶斯神经网络原理_权重_55$ ，则有

$MC 贝叶斯神经网络贝叶斯神经网络原理_权重_56$

由于 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_07$ 相互独立，则有
$MC 贝叶斯神经网络贝叶斯神经网络原理_反向传播_58$

在m=1的情况下，将式6.0、7.0代入4.0可得
$MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_59$

8.0即1.0，至此，BNN的损失函数推导完毕，剩下一个采样问题，W服从 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_60$ 分布，由于每个权重都服从一个高斯分布，且各个权重之间相互独立，所以 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_60$ 是一个独立多元正态分布，独立多元正态分布由多个正态分布相乘得到，将各个正态分布中采样得到的样本点组合在一起得到的样本点将服从独立多元正态分布，所以采样时，权重 $MC 贝叶斯神经网络贝叶斯神经网络原理_MC 贝叶斯神经网络_07$ 的样本点只需从 $MC 贝叶斯神经网络贝叶斯神经网络原理_神经网络_14$ 采样即可（写程序时可以直接从独立多元正态分布中采样，这里如此表述只是为了方便理解）。

pytorch实现BNN

本节实现的BNN为一个单隐藏层神经网络，其输入大小为1，输出大小为1。用于拟合函数 $MC 贝叶斯神经网络贝叶斯神经网络原理_权重_64$ 进行回归预测

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Normal
import numpy as np

# BNN层，类似于BP网络的Linear层，与BP网络类似，一层BNN层由weight和bias组成，weight和bias都具有均值和方差
class Linear_BBB(nn.Module):
    """
        Layer of our BNN.
    """
    def __init__(self, input_features, output_features, prior_var=1.):
        """
            Initialization of our layer : our prior is a normal distribution
            centered in 0 and of variance 20.
        """
        # initialize layers
        super().__init__()
        # set input and output dimensions
        self.input_features = input_features
        self.output_features = output_features

        # initialize mu and rho parameters for the weights of the layer
        self.w_mu = nn.Parameter(torch.zeros(output_features, input_features))
        self.w_rho = nn.Parameter(torch.zeros(output_features, input_features))

        #initialize mu and rho parameters for the layer's bias
        self.b_mu =  nn.Parameter(torch.zeros(output_features))
        self.b_rho = nn.Parameter(torch.zeros(output_features))

        #initialize weight samples (these will be calculated whenever the layer makes a prediction)
        self.w = None
        self.b = None

        # initialize prior distribution for all of the weights and biases
        self.prior = torch.distributions.Normal(0,prior_var)

    def forward(self, input):
        """
          Optimization process
        """
        # sample weights
        # 从标准正态分布中采样权重
        w_epsilon = Normal(0,1).sample(self.w_mu.shape)
        # 获得服从均值为mu，方差为delta的正态分布的样本
        self.w = self.w_mu + torch.log(1+torch.exp(self.w_rho)) * w_epsilon

        # sample bias
        # 与sample weights同理
        b_epsilon = Normal(0,1).sample(self.b_mu.shape)
        self.b = self.b_mu + torch.log(1+torch.exp(self.b_rho)) * b_epsilon

        # record log prior by evaluating log pdf of prior at sampled weight and bias
        # 计算log p(w)，用于后续计算loss
        w_log_prior = self.prior.log_prob(self.w)
        b_log_prior = self.prior.log_prob(self.b)
        self.log_prior = torch.sum(w_log_prior) + torch.sum(b_log_prior)

        # record log variational posterior by evaluating log pdf of normal distribution defined by parameters with respect at the sampled values
        # 计算 log p(w|\theta)，用于后续计算loss
        self.w_post = Normal(self.w_mu.data, torch.log(1+torch.exp(self.w_rho)))
        self.b_post = Normal(self.b_mu.data, torch.log(1+torch.exp(self.b_rho)))
        self.log_post = self.w_post.log_prob(self.w).sum() + self.b_post.log_prob(self.b).sum()

        # 权重确定后，和BP网络层一样使用
        return F.linear(input, self.w, self.b)

class MLP_BBB(nn.Module):
    def __init__(self, hidden_units, noise_tol=.1,  prior_var=1.):

        # initialize the network like you would with a standard multilayer perceptron, but using the BBB layer
        super().__init__()
        # 输入为1，输出为1，只含有一层隐藏层的BNN
        self.hidden = Linear_BBB(1,hidden_units, prior_var=prior_var)
        self.out = Linear_BBB(hidden_units, 1, prior_var=prior_var)
        self.noise_tol = noise_tol # we will use the noise tolerance to calculate our likelihood

    def forward(self, x):
        # again, this is equivalent to a standard multilayer perceptron
        # 激活函数选用sigmoid
        x = torch.sigmoid(self.hidden(x))
        x = self.out(x)
        return x

    def log_prior(self):
        # calculate the log prior over all the layers
        return self.hidden.log_prior + self.out.log_prior

    def log_post(self):
        # calculate the log posterior over all the layers
        return self.hidden.log_post + self.out.log_post

    # 计算loss
    def sample_elbo(self, input, target, samples):
        # we calculate the negative elbo, which will be our loss function
        #initialize tensors
        outputs = torch.zeros(samples, target.shape[0])
        log_priors = torch.zeros(samples)
        log_posts = torch.zeros(samples)
        log_likes = torch.zeros(samples)
        # make predictions and calculate prior, posterior, and likelihood for a given number of samples

        # 蒙特卡罗近似
        for i in range(samples):
            outputs[i] = self(input).reshape(-1) # make predictions
            log_priors[i] = self.log_prior() # get log prior
            log_posts[i] = self.log_post() # get log variational posterior
            log_likes[i] = Normal(outputs[i], self.noise_tol).log_prob(target.reshape(-1)).sum() # calculate the log likelihood
        # calculate monte carlo estimate of prior posterior and likelihood
        log_prior = log_priors.mean()
        log_post = log_posts.mean()
        log_like = log_likes.mean()
        # calculate the negative elbo (which is our loss function)
        loss = log_post - log_prior - log_like
        return loss

def toy_function(x):
    return -x**4 + 3*x**2 + 1

# toy dataset we can start with
x = torch.tensor([-2, -1.8, -1, 1, 1.8, 2]).reshape(-1,1)
y = toy_function(x)

net = MLP_BBB(32, prior_var=10)
optimizer = optim.Adam(net.parameters(), lr=.1)
epochs = 2000
for epoch in range(epochs):  # loop over the dataset multiple times
    optimizer.zero_grad()
    # forward + backward + optimize
    loss = net.sample_elbo(x, y, 1)
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
        print('epoch: {}/{}'.format(epoch+1,epochs))
        print('Loss:', loss.item())
print('Finished Training')

samples = 100
x_tmp = torch.linspace(-5,5,100).reshape(-1,1)
y_samp = np.zeros((samples,100))
for s in range(samples):
    y_tmp = net(x_tmp).detach().numpy()
    y_samp[s] = y_tmp.reshape(-1)

print("test result:",np.mean(y_samp, axis = 0))