pytorch 自动混合精度训练

原创

究极可爱怪 2022-05-31 10:02:33 博主文章分类：pytorch ©著作权

文章标签 pytorch 缩放自动转换迭代 文章分类 PyTorch 人工智能

©著作权归作者所有：来自51CTO博客作者究极可爱怪的原创作品，请联系作者获取转载授权，否则将追究法律责任

1 torch.cuda.amp混合精度训练
2 Autocasting

2.1 torch.autocast
2.2 torch.cuda.amp.autocast

3 Gradient Scaling

3.1 使用案例

1 torch.cuda.amp混合精度训练

混合精度训练提供了自适应的float32(单精度)与float16(半精度)数据适配，我们必须同时使用 torch.autocast and torch.cuda.amp.GradScaler 才能起到作用。然而，torch.autocast和GradScaler是模块化的，如果需要可以单独使用。混合精度的原理参考：

2 Autocasting

2.1 torch.autocast

torch.autocast(device_type, enabled=True, **kwargs)

上下文管理器或装饰器autocast的实例，允许脚本区域以混合精度训练。在这些区域中，ops 在 autocast 选择的特定于 op 的 dtype 中运行，以在保持准确性的同时提高性能。有关详细信息，请参阅Autocast Op 参考。autocast应该只包装网络的前向传递，包括损失计算。不推荐包含反向传播。后向操作与autocast在前向过程中操作的类型相同。

有两种形式可以实现autocast:

上下文管理器

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

for input, target in data:
    optimizer.zero_grad()

    # Enables autocasting for the forward pass (model + loss)
    with autocast():
        output = model(input)
        loss = loss_fn(output, target)

    # Exits the context manager before backward()
    loss.backward()
    optimizer.step()

装饰器

class AutocastModel(nn.Module):
    ...
    @autocast()
    def forward(self, input):
        ...

在autocast区域的代码会将部分张量精度转为float16，如果直接使用这些张量进行计算，可能会报错。所以离开这个区域后我们尽量将其转换回float32再进行计算！

# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")

with autocast():
    # torch.mm is on autocast's list of ops that should run in float16.
    # Inputs are float32, but the op runs in float16 and produces float16 output.
    # No manual casts are required.
    e_float16 = torch.mm(a_float32, b_float32)
    # Also handles mixed input types
    f_float16 = torch.mm(d_float32, e_float16)

# After exiting autocast, calls f_float16.float() to use with d_float32
g_float32 = torch.mm(d_float32, f_float16.float())

autocast(enabled=False)子区域可以嵌套在autocast的区域中。例如，如果您想强制子区域在特定的dtype. 禁用自动转换使您可以显式控制执行类型。

# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")

with autocast():
    e_float16 = torch.mm(a_float32, b_float32)
    with autocast(enabled=False):
        # Calls e_float16.float() to ensure float32 execution
        # (necessary because e_float16 was created in an autocasted region)
        f_float32 = torch.mm(c_float32, e_float16.float())

    # No manual casts are required when re-entering the autocast-enabled region.
    # torch.mm again runs in float16 and produces float16 output, regardless of input types.
    g_float16 = torch.mm(d_float32, f_float32)

参数说明

device_type(string,required) -- 是使用“cuda”还是“cpu”设备
enabled(bool,optional,default=True) -- 是否应该在区域中启用autocast。
dtype(torch_dpython:type,optional) -- 是使用 torch.float16 还是 torch.bfloat16。
cache_enabled(bool,optional,default=True) -- 是否应该启用自动转换中的权重缓存。

2.2 torch.cuda.amp.autocast

相当于torch.autocast("cuda", args...)

3 Gradient Scaling

如果特定操作的前向传递具有float16输入，则该操作的反向传递将产生float16梯度。小幅值的梯度值可能无法在float16中表示。这些值将刷新为零（“下溢”），因此相应参数的更新将丢失。为了防止下溢，“梯度缩放”将网络的损失乘以比例因子，并在缩放的损失上调用反向传递。然后通过相同的因子缩放通过网络向后流动的梯度。换句话说，梯度值的幅度更大，因此它们不会刷新为零。在优化器更新参数之前，每个参数的梯度（.grad属性）都应该是未缩放的，因此缩放因子不会干扰学习率。

torch.cuda.amp.GradScaler(init_scale=65536.0, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, enabled=True)

GradScaler有两个关键的方法：

GradScaler.step(optimizer)

内部调用unscale_(optimizer)（除非unscale_()在迭代早期明确调用）。optimizer作为unscale_()的一部分，梯度会检查 infs/NaNs;
如果没有找到 inf/NaN 梯度，则optimizer.step()使用未缩放的梯度调用。否则，optimizer.step()将跳过以避免损坏参数。

update(new_scale=None)
更新比例因子。如果跳过任何优化器步骤，则将比例乘以backoff_factor 以减少它。如果growth_interval未跳过的迭代连续发生，则将比例乘以growth_factor增加它。
通过new_scale手动设置新的比例值。（new_scale不直接使用，它用于填充 GradScaler 的内部尺度张量。因此，如果 new_scale是张量，则稍后对该张量的就地更改不会进一步影响 GradScaler 内部使用的尺度。）

3.1 使用案例

model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
scaler = amp.GradScaler(enabled=True)

for input, target in data:
    optimizer.zero_grad()

    # Enables autocasting for the forward pass (model + loss)
    with autocast():
        output = model(input)
        loss = loss_fn(output, target)

    # Exits the context manager before backward()
    # Backward
    scaler.scale(loss).backward()
    scaler.step(optimizer)  # optimizer.step
    scaler.update()