自动混合精度需要pytorch几 pytorch 混合精度训练

转载

mob6454cc79ab13 2024-08-29 21:09:14

文章标签 自动混合精度需要pytorch几 pytorch 缩放数据类型 CUDA 文章分类 PyTorch 人工智能

文章目录

PyTorch自动混合精度训练(AMP)手册

Autocasting
Gradient Scaling
Notes
Autocast Op Look-up-table
Reference

PyTorch自动混合精度训练(AMP)手册

自动混合精度 —— Automatic Mixed Precision, AMP

混合精度训练是指在训练过程中，一些操作使用float32数据类型的单精度，一些操作(linear/conv)使用float16数据类型的半精度。而自动混合精度训练则是指，自动给每个操作匹配其合适的数据类型(精度)。

一般来说，结合使用torch.cuda.amp.autocast和torch.cuda.amp.GradScaler，即可在torch中实现自动混合精度训练。torch.cuda.amp.autocast会为选中的区域开启autocasting功能，autocasting会自动为CUDA操作选择(单/半)精度来提升性能并保持精度(实际上可能会带来些许精度的提升)。torch.cuda.amp.GradScaler帮助执行梯度缩放步骤，梯度缩放会通过最小化梯度的underflow，来提升包含半精度(float16)梯度的网络的收敛。

Autocasting

torch实现autocasting主要是通过API——torch.cuda.amp.autocast，autocast实例对象是作为上下文管理器(context manger)或装饰器(decorator)来允许用户代码的某些区域在混合精度下运行，在这些区域中，CUDA操作会以一种由autocast选择的特定于操作的数据类型运行，以此来提升性能并保持准确率。

autocast应该只封装网络的前向传播(forward pass(es))，以及损失计算(loss computation(s))。反向传播不推荐在autocast区域内执行，反向传播的操作会自动以对应的前向传播的操作的数据类型运行。

代码示例如下：

......
CUDA = torch.cuda.is_available()
device = torch.device("cuda" if CUDA else "cpu")
model = model.to(device)
optimizer = optim.SGD(model.parameters(), ...)
......

# Creates a GradScaler once at the begin
scaler = amp.GradScaler(enabled=CUDA)

for inputs, targets in data:
    optimizer.zero_grad()
    ......
    
    # Autocast
    # Enables autocasting for the forward pass (model + loss)
    with amp.autocast(enabled=CUDA):
        # Forward
        preds = model(inputs)
        loss = loss_fn(preds, targets)
        
    # Exits the context manager before backward()
    # Scales loss. Calls backward() on scaled loss to create scaled gradients.
    scaler.scale(loss).backward()
    
    # scaler.step() first unscales the gradients of the optimizer's assigned params. If these gradients do not contains infs or NaNs, optimizer.step() is then called. Otherwise, optimizer.step() is skipped.
    scaler.step(optimizer)
    
    # Updates the scale for next iteration.
    scaler.update()

Gradient Scaling

如果前向传播对于一个特定的操作的输入是float16数据类型的，那么该操作对应的反向传播也会产生float16的梯度。小幅值的梯度值可能在半精度下是不可表示的。这些值可能会刷新为零(称为underflow)，因此对应参数的更新也会丢失。(这里可能是指类似梯度消失的问题，半精度下，如果梯度幅值小，权重在更新时几乎没有改变。)

为了避免underflow，采用**梯度缩放(gradient scaling)**的方式，将网络的损失乘以一个缩放因子，并对缩放后的损失调用反向传播，然后，通过网络反向流动的梯度将按相同的因子进行缩放。也就是说，缩放后梯度值会有一个较大的幅值，因此它们不会被刷新为零。

每个参数的梯度在优化器(optimizer)更新参数之前应该是未缩放的，所以缩放因子不会影响学习率。

代码示例在上方，下方为api详细说明。

scaler = torch.cuda.amp.GradScaler(init_scale=65536.0, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, enabled=True)

# Params
# init_scale (float, optional, default=2.**16): 初始的缩放因子。
# growth_factor (float, optional, default=2.0): 如果在growth_interval个连续迭代中没有出现inf/NaN的梯度，调用scaler.update()操作会执行growth_factor*scale操作，growth_factor主要用于增大缩放因子，一般为大于1.0的数。
# backoff_factor (float, optional, default=0.5): 如果在一个连续迭代中出现了inf/NaN的梯度，调用scaler.update()操作会执行growth_factor*scale操作，backoff_factor主要用于衰减缩放因子，一般为0.0~1.0的数。
# growth_interval (int, optional, default=2000): growth_interval用于控制在多少个连续迭代次数未出现inf/NaN的梯度时去增大缩放因子(即执行update():growth_factor*scale)。
# enabled (bool, optional, default=True): If False, disables gradient scaling. 'step' op simply invokes the underlying 'optimizer.step()', and other methods become no-ops. 

# Methods
# scale()方法会将输入的一个张量或一个列表的张量乘以一个缩放因子。
scaler.scale(outputs) -> scaled outputs

# step()方法执行两个操作：
# 1. 内部调用unscale_(optimizer)方法，除非之前已经对optimizer显式调用过unscale_()方法了。作为unscale_()方法的部分功能，会检查梯度是否为inf/NaN。
# 2. 如果没有发现有inf/NaN的梯度，使用解缩放后的梯度调用optimizer.step(*args, **kwargs)。否则，会跳过optimizer.step()，以避免破坏参数。
scaler.step(optimizer, *args, **kwargs) -> the return value of optimizer.step(*args, **kwargs)

# unscale_()方法会让优化器optimizer的梯度张量除以(devide，也可理解为unscale)缩放因子。
# unscale_()方法是可选的，一般用于需要在反向传播跟step()更新参数之间修改(modify)或检查(inspect)梯度的情况。如果没有显式调用，也会在调用scaler.step()时内部自动调用。
scaler.unscale_(optimizer)

# update()方法用于更新缩放因子。传入new_scale参数来直接设定缩放因子scale。
# 如果任一优化器step操作被跳过了，则将缩放因子scale乘以backoff_factor来减小它；如果growth_intervel个连续迭代都没有跳过优化器step操作，则将缩放因子scale乘以growth_factor来增大它。
# update()方法应该只在迭代的最后被调用，即在一个迭代中给所有使用的优化器调用scaler.step(optimizer)操作后使用。
scaler.update(new_scale=None)

Notes

只有CUDA操作才适用于autocasting；运行在双精度float64或非浮点数据类型下的操作不适用于autocasting，这两种情况不会被执行autocasting，仍会运行在原数据类型下；只有out-of-place操作(即非inplace操作)和Tensor方法才适用于autocasting；显式指定dtype参数的操作op(…, dtype=xx)不适用于autocasting，此类操作会生成指定的dtype类型的输出。
反向传播不推荐在autocast区域内执行。
在autocast区域内生成的浮点张量(Floating-point Tensors)可能是半精度，float16数据类型的。在返回非autocast区域时，如果将上述浮点张量与不同数据类型的浮点张量一起使用，可能导致类型不匹配错误(type mismatch errors)。在这种情况下，应将手动(manually)将autocast区域产生的半精度浮点张量强制转换成单精度(或其他需要的数据类型)，如执行tensor.float()操作。
autocast区域内可以继续嵌套autocast区域或非autocast区域，可通过autocast()的enabled参数控制，简单示例如下。嵌套操作在需要强制某子区域要运行在特定数据类型下时非常有效，但仍需要注意避免张量类型不匹配的问题。

with autocast(enabled=True):
    >>>> op1 with autocast
    
    with autocast(enabled=False):
        >>>> op2 needed no autocast
       
    >>>> other ops

autocast状态是线程本地(thread-local)的，意思是想要autocast在一个新的线程下运行，必须在该线程下调用上下文管理器或装饰器。如果按基本的操作执行，即只在执行主进程的main文件中设置autocasting区域时，在多GPU运行(一个GPU一个进程)的DP或DDP模式下，autocasting将不会起任何作用。简单理解：多GPU运行的DP/DDP模式会在每个device上生成线程来执行前向传播操作，执行autocasting的是main文件生成的主线程，而执行前向传播操作的是各个device各自生成的线程，因为autocasting是线程本地(局域)作用的，因此即使在autocasting区域内，执行在其他线程的前向传播操作不会被autocast。

model = MyModel()
dp_model = nn.DataParallel(model)

# Sets autocast in the main thread
with autocast():
    # dp_model's internal threads won't autocast.  The main thread's autocast state has no effect.
    output = dp_model(input)
    # loss_fn still autocasts, but it's too late...
    loss = loss_fn(output)

解决办法很简单：在原设置的基础上，将autocast融合进model的forward方法，要么将autocast()作为forward()方法的装饰器，要么在forward()方法内部设置autocasting区域。(按如下示例操作，上方的代码就会起作用了)

MyModel(nn.Module):
    ...
    @autocast()
    def forward(self, input):
       ...

# Alternatively
MyModel(nn.Module):
    ...
    def forward(self, input):
        with autocast():
            ...