pytorch如何标准化数据 pytorch batch normalization

转载

jack 2023-10-08 13:25:46

文章标签 pytorch如何标准化数据 pytorch 深度学习 batch 激活函数 文章分类 PyTorch 人工智能

简介

Batch Normalization是如今深度学习常见的方法，来加速深层网络训练的收敛，并且有正则化的作用，甚至可以不使用Dropout或者减小神经元被drop的概率。

原理

深度神经网络训练的问题

深度神经网络模型的训练为什么困难、收敛慢？这个问题的解决在之前的工作中，有从尝试新的激活函数角度，如 ReLU、Maxout、PReLU等；有从权值初始化方法角度，如Xavier初始化、Kaiming初始化等，但收益相对都不是很高。

这个问题的存在使得在设置 learning rate、初始化方法还有激活函数上得要很慎重，而这个问题的实质是因为在训练过程中，随着前一层参数的改变，哪怕很小的改变，也会因为网络加深而被放大，而这种改变使得每一层输入的分布发生改变，从而每一层需要持续地适应这种改变，这种现象在论文中被称为 Internal covariate shift （ICS，内部协变量偏移）。

从另外一个角度看，把网络的中间层看作是 sub-network 以及采用 sigmoid 激活函数 z = g ( W u + b )，其中 u 是该层的输入，而 W 和 b 是学习的参数。随着 ∣ x ∣ 增大，g ′ ( x ) 趋向于0，这也就意味着会发生我们熟知的梯度消失（本质是落入饱和区），后来 ReLU 和一些初始化方法是较好的缓解了这个问题。

pytorch如何标准化数据 pytorch batch normalization_batch

换一个思路想，造成这个问题比较直观的原因是因为 ∣ x ∣ 的增大，即输入分布的变化，以及上面提到的 ICS 现象也是因为每一层的输入分布发生改变，那么很自然的想法就是，如果能确保输入的分布稳定，那么就不容易陷入饱和区域，从而梯度消失的问题也就得到很好的缓和，训练收敛的速度也随之提升了，问题就迎刃而解了。

看着好像蛮简单的，但还是有随之而来的问题：

确保输入的分布稳定，即 Normalization ，该怎么做？
Normalization 能使得输入不落入饱和区域，反过来就是限制输入落入激活函数的线性区域，那这样网络不就失去了非线性的表达能力了吗，这该怎么弥补？

上面的两个也就是 Batch Normalization 这篇论文工作的核心。

解决方法

白化（whitening） 作为一个很重要的数据预处理方法，它能使得模型训练收敛的更快，而白化一般包含两个目的：

去除特征之间的相关性（特征独立）；
使得所有特征具有相同的均值和方差（同分布）

白化可以使得模型的输入标准化为均值为0，方差为1，那可以考虑将白化拓展到每一层的输入，就能使得每一层的分布趋于稳定。然而，标准的白化操作代价昂贵，特别是我们还希望白化操作是可微的，保证白化操作可以通过反向传播来更新梯度，即：

pytorch如何标准化数据 pytorch batch normalization_深度学习_02

于是就有了本篇论文的工作，Batch Normalization（BN），即对白化做了简化，将其作用到每一层的输入，使得输入分布稳定。

第一个简化是只对每一个特征维度做 Normalization，并没有考虑特征之间去相关，论文中也提到了，“such normalization speeds up convergence, even when the features are not decorrelated.”，即：

pytorch如何标准化数据 pytorch batch normalization_pytorch_03

第二个简化是不通过整个数据集来统计 E[x^(k)] 和 Var[x^(k)] 和而是通过一个 mini-batch 内的激活值来估计均值和方差，得到：

pytorch如何标准化数据 pytorch batch normalization_batch_04

pytorch如何标准化数据 pytorch batch normalization_深度学习_05

而这样的 Normalization 是存在问题的，如论文中提到 “simply normalizing each input of a layer may change what the layer can represent.”，即简单地做 Normalization 会减低了网络的非线性的表达能力，比如采用 sigmoid 激活函数，normalization 会限制激活值落入到线性区域（近似线性），而这片区域是近似 [-2, 2] 这个区间，而在标准正态分布中，落入 [-2, 2] 的概率是95%。

这也就前面提到问题，作者采用的方式是 “make sure that the transformation inserted in the network can represent the identity transform”，为了实现这种可以恒等变换，引入了两个可学习的参数 γ^(k) 和 β^(k)，使得：

pytorch如何标准化数据 pytorch batch normalization_激活函数_06

而在最极端的情况下，学习的两个参数分别为

pytorch如何标准化数据 pytorch batch normalization_pytorch_07

，则可以恢复到原先的激活值，即恢复了网络的表达能力（restore the representation power of the network）。

Batch Normalization 大致的算法过程如下：

pytorch如何标准化数据 pytorch batch normalization_激活函数_08

BN 采用 mini-batch 来估计均值和方差，这在训练的时候是可行的，但在 inference 或 online inference 时，是单实例的，不存在 mini-batch，所以就无法获得BN计算所需的均值和方差，这就需要利用训练阶段的Batch统计值，估计一个总体的均值和方差，从而实现 inference 阶段的 normalization：

pytorch如何标准化数据 pytorch batch normalization_深度学习_09

值得注意的是，在PyTorch代码实现的时候，会去采用指数滑动平均（Exponential Moving Average）来实现总体的估计，见PyTorch文档描述：

pytorch如何标准化数据 pytorch batch normalization_batch_10

更高的学习率

在论文摘要部分提到了，“Batch Normalization allows us to use much higher learning rates and be less careful about initialization.”，在还没看正文时，就有疑问，为什么 BN 可以使得采用更大的学习率？作者专门在一小节中对其进行了解释。

通常在训练深层网络，不会使用太大的学习率，因为它易导致梯度爆炸、梯度消失或者陷入到 poor local minima。而前面也提到，BN 能够一定程度上避免输入落入到激活函数的饱和区域，缓解了梯度消失的问题；另外因为每层输入都有 normalization 的存在，缓解了 ICS 的存在，使得每层的输入分布稳定，参数的变化（反向传播的梯度）也趋于稳定（不会因为随着层数加深，参数变化被放大），较好地缓解了梯度爆炸。

论文中称 “Batch Normalization also makes training more resilient to parameter scale” ，并对这种参数更新（梯度）的稳定做了一个分析。因为通常，较大的学习率会使得参数的规模增大（increase the scale of layer parameters），假设增大了 a 倍，但因为 BN 的存在，反向传播并不会受到参数增大的影响，从而导致的梯度爆炸：

pytorch如何标准化数据 pytorch batch normalization_激活函数_11

第三个式子还能看到，更大的参数返回会导致更小的梯度，从而上面的结论也得到了验证：BN 的存在，使得参数变化趋于稳定，故能使用更大的学习率。

正则化

在论文的摘要中提到因为 BN 的存在，使得 Dropout 可以被移除或者减小神经元被 drop 的概率，换句话说，BN 具备了 dropout 提升模型泛化能力（缓解过拟合）的功能，主要是因为通过 batch 内的激活值估计均值和方差，不是根据单一 sample 的值做模型优化，这可以变相地看成是某一种约束。

这边有三点需要注意的是：

论文中讨论将BN放在激活函数前好还是后好，“but since u is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift.”，不是很明白，而 [1] 中提到 “不少研究表明将BN放在激活函数之后效果更好。” ，但并没有给出参考文献，这里还需后面验证。
对于卷积层，在估计均值和方差时考虑的神经元激活值集合并不是标量特征，而是一个 feature map 内的特征值，“we jointly normalize all the activations in a mini-batch over all locations. In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations”
在 Normalization 前的线性变换 W u + b ，通常会省略掉偏置项 b ，因为它的作用会被随后BN的均值给抵消掉（它的作用会被 β 代替）。

实验

作者通过一个很简单的实验来验证BN是否能较好的缓解ICS问题。在MNIST数据集上，一个简单的三层的神经网络，在每一层之前加入BN，观察收敛速度和每层的输入分布变化，结果如下图：

pytorch如何标准化数据 pytorch batch normalization_pytorch_12

另外一个实验就是在 ImageNet Classification 任务上，实验的模型结构在 Inception-v1 上做的改进是参考 VGG 中小尺寸的卷积核的思想，将 5×5 的卷积核替换为两层 3 × 3 的卷积核，并且加宽了网络和引入了BN，另外值得注意的是，在Inception-v2的网络结构设计中，用 stride=2 代替了 Inception-v1 中 max pool 做 feature map 的尺寸缩减。

为了进一步加速 BN networks 训练收敛的速度，作者进一步改进了网络和训练的超参数：

Increase learning rate：在上面也提到了可以使用更大的学习率，那么增大学习率来加快模型训练；
Remove dropout：前面同样提到 BN 能一定程度缓解过拟合，那就将 dropout 移除来提速；
Reduce the L 2 weight regularization：减少 L2 正则项的权值；
Accelerate the learning rate decay：因为 BN 加快了模型的训练，所以相应的学习率的衰减也得加快，这里采用了指数衰减；
Remove Local Response Normalization：这里是移除了 LRN，其实在之前的一些工作中就不用它了；
Shuffle training examples more thoroughly
Reduce the photometric distortions
另外，作者还做了一些调整，设置了几个模型：
BN-Baseline：即相较于 Inception-v1 引入了 BN；
BN-x5：将学习率增大了5倍，为0.0075
BN-x30：将学习率增大了30倍，为0.045
BN-x5-Sigmoid：和 BN-x5 类似，但激活函数换成了Sigmoid。
下面是实验的结果，可以看到仅采用 BN 的Baselin 其收敛速度是快于 Inception v1，另外 BN-x5 更是快了14倍（达到Inception 的准确率），而虽 BN-x30 较之 BN-x5 会慢一些，但其准确率是最高的。有意思的是，论文中还玩了一个小小的文字游戏：we apply Batch Normalization to the bestperforming ImageNet classification network, and show that we can match its performance using only 7% of the training steps, and can further exceed its accuracy by a substantial margin. ，其实这里 7% 是由 BN-x5 实现的，而 BN-x30 是在准确率上做了提升。

Pytorch实现

下面将Batch Normalization加入到GoogleNet中，并将5x5的卷积核用2个2x2的卷积核代替。
执行下面代码，将数据集划分为训练集与验证集。
split_data.py

import os
from shutil import copy
import random


def mkfile(file):
    if not os.path.exists(file):
        os.makedirs(file)


file = 'flower_data/flower_photos'
flower_class = [cla for cla in os.listdir(file) if ".txt" not in cla]
mkfile('flower_data/train')
for cla in flower_class:
    mkfile('flower_data/train/'+cla)

mkfile('flower_data/val')
for cla in flower_class:
    mkfile('flower_data/val/'+cla)

split_rate = 0.1
for cla in flower_class:
    cla_path = file + '/' + cla + '/'
    images = os.listdir(cla_path)
    num = len(images)
    eval_index = random.sample(images, k=int(num*split_rate))
    for index, image in enumerate(images):
        if image in eval_index:
            image_path = cla_path + image
            new_path = 'flower_data/val/' + cla
            copy(image_path, new_path)
        else:
            image_path = cla_path + image
            new_path = 'flower_data/train/' + cla
            copy(image_path, new_path)
        print("\r[{}] processing [{}/{}]".format(cla, index+1, num), end="")  # processing bar
    print()

print("processing done!")

model.py

import torch.nn as nn
import torch
import torch.nn.functional as F


class GoogLeNet(nn.Module):
    def __init__(self, num_classes=1000, aux_logits=True, init_weights=False):
        super(GoogLeNet, self).__init__()
        self.aux_logits = aux_logits

        self.conv1 = BasicConv2d(3, 64, kernel_size=7, stride=2, padding=3)
        self.maxpool1 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.conv2 = BasicConv2d(64, 64, kernel_size=1)
        self.conv3 = BasicConv2d(64, 192, kernel_size=3, padding=1)
        self.maxpool2 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception3a = Inception(192, 64, 96, 128, 16, 32, 32)
        self.inception3b = Inception(256, 128, 128, 192, 32, 96, 64)
        self.maxpool3 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception4a = Inception(480, 192, 96, 208, 16, 48, 64)
        self.inception4b = Inception(512, 160, 112, 224, 24, 64, 64)
        self.inception4c = Inception(512, 128, 128, 256, 24, 64, 64)
        self.inception4d = Inception(512, 112, 144, 288, 32, 64, 64)
        self.inception4e = Inception(528, 256, 160, 320, 32, 128, 128)
        self.maxpool4 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception5a = Inception(832, 256, 160, 320, 32, 128, 128)
        self.inception5b = Inception(832, 384, 192, 384, 48, 128, 128)

        if self.aux_logits:
            self.aux1 = InceptionAux(512, num_classes)
            self.aux2 = InceptionAux(528, num_classes)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout(0.4)
        self.fc = nn.Linear(1024, num_classes)
        if init_weights:
            self._initialize_weights()

    def forward(self, x):
        # N x 3 x 224 x 224
        x = self.conv1(x)
        # N x 64 x 112 x 112
        x = self.maxpool1(x)
        # N x 64 x 56 x 56
        x = self.conv2(x)
        # N x 64 x 56 x 56
        x = self.conv3(x)
        # N x 192 x 56 x 56
        x = self.maxpool2(x)

        # N x 192 x 28 x 28
        x = self.inception3a(x)
        # N x 256 x 28 x 28
        x = self.inception3b(x)
        # N x 480 x 28 x 28
        x = self.maxpool3(x)
        # N x 480 x 14 x 14
        x = self.inception4a(x)
        # N x 512 x 14 x 14
        if self.training and self.aux_logits:    # eval model lose this layer
            aux1 = self.aux1(x)

        x = self.inception4b(x)
        # N x 512 x 14 x 14
        x = self.inception4c(x)
        # N x 512 x 14 x 14
        x = self.inception4d(x)
        # N x 528 x 14 x 14
        if self.training and self.aux_logits:    # eval model lose this layer
            aux2 = self.aux2(x)

        x = self.inception4e(x)
        # N x 832 x 14 x 14
        x = self.maxpool4(x)
        # N x 832 x 7 x 7
        x = self.inception5a(x)
        # N x 832 x 7 x 7
        x = self.inception5b(x)
        # N x 1024 x 7 x 7

        x = self.avgpool(x)
        # N x 1024 x 1 x 1
        x = torch.flatten(x, 1)
        # N x 1024
        x = self.dropout(x)
        x = self.fc(x)
        # N x 1000 (num_classes)
        if self.training and self.aux_logits:   # eval model lose this layer
            return x, aux2, aux1
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

#inception结构
class Inception(nn.Module):
    def __init__(self, in_channels, ch1x1, ch3x3red, ch3x3, ch5x5red, ch5x5, pool_proj):
        super(Inception, self).__init__()

        self.branch1 = BasicConv2d(in_channels, ch1x1, kernel_size=1)

        self.branch2 = nn.Sequential(
            BasicConv2d(in_channels, ch3x3red, kernel_size=1),
            BasicConv2d(ch3x3red, ch3x3, kernel_size=3, padding=1)   # 保证输出大小等于输入大小
        )

        self.branch3 = nn.Sequential(
            BasicConv2d(in_channels, ch5x5red, kernel_size=1),
            BasicConv2d(ch5x5red, ch5x5, kernel_size=3, padding=1),
            BasicConv2d(ch5x5, ch5x5, kernel_size=3, padding=1)     # 保证输出大小等于输入大小
        )

        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            BasicConv2d(in_channels, pool_proj, kernel_size=1)
        )

    def forward(self, x):
        branch1 = self.branch1(x)
        branch2 = self.branch2(x)
        branch3 = self.branch3(x)
        branch4 = self.branch4(x)

        outputs = [branch1, branch2, branch3, branch4]
        return torch.cat(outputs, 1)

#辅助分类器
class InceptionAux(nn.Module):
    def __init__(self, in_channels, num_classes):
        super(InceptionAux, self).__init__()
        self.averagePool = nn.AvgPool2d(kernel_size=5, stride=3)
        self.conv = BasicConv2d(in_channels, 128, kernel_size=1)  # output[batch, 128, 4, 4]

        self.fc1 = nn.Linear(2048, 1024)
        self.fc2 = nn.Linear(1024, num_classes)

    def forward(self, x):
        # aux1: N x 512 x 14 x 14, aux2: N x 528 x 14 x 14
        x = self.averagePool(x)
        # aux1: N x 512 x 4 x 4, aux2: N x 528 x 4 x 4
        x = self.conv(x)
        # N x 128 x 4 x 4
        x = torch.flatten(x, 1)
        x = F.dropout(x, 0.5, training=self.training)
        # N x 2048
        x = F.relu(self.fc1(x), inplace=True)
        x = F.dropout(x, 0.5, training=self.training)
        # N x 1024
        x = self.fc2(x)
        # N x num_classes
        return x


class BasicConv2d(nn.Module):
    def __init__(self, in_channels, out_channels, **kwargs):
        super(BasicConv2d, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, **kwargs)
        self.bn = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        return x

train.py

import torch
import torch.nn as nn
from torchvision import transforms, datasets
import torchvision
import json
import matplotlib.pyplot as plt
import os
import torch.optim as optim
from model import GoogLeNet

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

data_transform = {
    "train": transforms.Compose([transforms.RandomResizedCrop(224),
                                 transforms.RandomHorizontalFlip(),
                                 transforms.ToTensor(),
                                 transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]),
    "val": transforms.Compose([transforms.Resize((224, 224)),
                               transforms.ToTensor(),
                               transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])}

#data_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))  # get data root path
data_root = os.getcwd()
image_path = data_root + "/flower_data/"  # flower data set path

train_dataset = datasets.ImageFolder(root=image_path + "train",
                                     transform=data_transform["train"])
train_num = len(train_dataset)

# {'daisy':0, 'dandelion':1, 'roses':2, 'sunflower':3, 'tulips':4}
flower_list = train_dataset.class_to_idx
cla_dict = dict((val, key) for key, val in flower_list.items())
# write dict into json file
json_str = json.dumps(cla_dict, indent=4)
with open('class_indices.json', 'w') as json_file:
    json_file.write(json_str)

batch_size = 32
train_loader = torch.utils.data.DataLoader(train_dataset,
                                           batch_size=batch_size, shuffle=True,
                                           num_workers=0)

validate_dataset = datasets.ImageFolder(root=image_path + "val",
                                        transform=data_transform["val"])
val_num = len(validate_dataset)
validate_loader = torch.utils.data.DataLoader(validate_dataset,
                                              batch_size=batch_size, shuffle=False,
                                              num_workers=0)

# test_data_iter = iter(validate_loader)
# test_image, test_label = test_data_iter.next()

# net = torchvision.models.googlenet(num_classes=5)
# model_dict = net.state_dict()
# pretrain_model = torch.load("googlenet.pth")
# del_list = ["aux1.fc2.weight", "aux1.fc2.bias",
#             "aux2.fc2.weight", "aux2.fc2.bias",
#             "fc.weight", "fc.bias"]
# pretrain_dict = {k: v for k, v in pretrain_model.items() if k not in del_list}
# model_dict.update(pretrain_dict)
# net.load_state_dict(model_dict)
net = GoogLeNet(num_classes=5, aux_logits=True, init_weights=True)
net.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.0015)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=1500, gamma=0.1) # 权重衰减策略

best_acc = 0.0
save_path = './googleNet.pth'
for epoch in range(30):
    # train
    net.train()
    running_loss = 0.0
    for step, data in enumerate(train_loader, start=0):
        images, labels = data
        optimizer.zero_grad()
        logits, aux_logits2, aux_logits1 = net(images.to(device))
        loss0 = loss_function(logits, labels.to(device))
        loss1 = loss_function(aux_logits1, labels.to(device))
        loss2 = loss_function(aux_logits2, labels.to(device))
        loss = loss0 + loss1 * 0.3 + loss2 * 0.3
        loss.backward()
        optimizer.step()
        scheduler.step()

        # print statistics
        running_loss += loss.item()
        # print train process
        rate = (step + 1) / len(train_loader)
        a = "*" * int(rate * 50)
        b = "." * int((1 - rate) * 50)
        print("\rtrain loss: {:^3.0f}%[{}->{}]{:.3f}".format(int(rate * 100), a, b, loss), end="")
    print()

    # validate
    net.eval（)
    acc = 0.0  # accumulate accurate number / epoch
    with torch.no_grad():
        for val_data in validate_loader:
            val_images, val_labels = val_data
            outputs = net(val_images.to(device))  # eval model only have last output layer
            predict_y = torch.max(outputs, dim=1)[1]
            acc += (predict_y == val_labels.to(device)).sum().item()
        val_accurate = acc / val_num
        if val_accurate > best_acc:
            best_acc = val_accurate
            torch.save(net.state_dict(), save_path)
        print('[epoch %d] train_loss: %.3f  test_accuracy: %.3f' %
              (epoch + 1, running_loss / step, val_accurate))

print('Finished Training')

Output:

pytorch如何标准化数据 pytorch batch normalization_pytorch_13

predict.py

import torch
from model import GoogLeNet
from PIL import Image
from torchvision import transforms
import matplotlib.pyplot as plt
import json
import os

os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

data_transform = transforms.Compose(
    [transforms.Resize((224, 224)),
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

# load image
img = Image.open("./sunflower.jpg")
plt.imshow(img)
# [N, C, H, W]
img = data_transform(img)
# expand batch dimension
img = torch.unsqueeze(img, dim=0)

# read class_indict
try:
    json_file = open('./class_indices.json', 'r')
    class_indict = json.load(json_file)
except Exception as e:
    print(e)
    exit(-1)

# create model
model = GoogLeNet(num_classes=5, aux_logits=False)
# load model weights
model_weight_path = "./googleNet.pth"
missing_keys, unexpected_keys = model.load_state_dict(torch.load(model_weight_path), strict=False)
model.eval（)
with torch.no_grad():
    # predict class
    output = torch.squeeze(model(img))
    predict = torch.softmax(output, dim=0)
    predict_cla = torch.argmax(predict).numpy()
print(class_indict[str(predict_cla)])
plt.show()

Output:

pytorch如何标准化数据 pytorch batch normalization_batch_14