1. 单机多卡

1.1 torch.nn.DataParallel

当采用多卡训练神经网络时,通过nvidia-smi命令查看当前可用的gpu卡,在文件的头部添加:

os.environ['CUDA_VISIBLE_DEVICES'] = "1, 2, 3"

使得当前代码仅对上述显卡可见,系统会对所有可见的显卡依然按照从0开始编号。

如何将模型和数据加载到多GPU上?

from torch import nn

device = torch.device('cuda' if torch.cuda.is_available()  else 'cpu')

inputs = inputs.to(device)
targets = targets.to(device)

if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)

model.to(device)
  •  数据

        对于数据而言,直接通过tensor.to(device)将其搬运到GPU并分配

  • 模型

        通过nn.DataParallel(model)将其搬运到各GPU上,而不需要其他额外操作

1.2  torch.nn.parallel.DistributedDataParallel() 

参考链接:Distributed communication package - torch.distributed — PyTorch master documentation

torch.distributed package支持在一台或多台机器上运行的多个计算节点之间的多进程并行计算和通信,也就可以利用单机多卡或多机多卡进行网络训练。

1.2.1 torch.distributed package初始化

torch.distributed.init_process_group(backend, init_method=None,
 timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')

 通过调用上述函数,可以用来初始化默认的distributed process group和distributed package。

  • backend:支持通讯的后端
  • inti_method:各个机器之间通讯的方式
  • rank:用来标识主机和从机,
  • world_size:使用几个主机

1.2.2 数据加载

与DataParallel方式不同,需要通过如下方法获得分布式数据采样器使得数据能均匀的被采样到各个机器上。

def distributed_is_initialized():
    """ Checks if a distributed cluster has been initialized """
    if dist.is_available():
        if dist.is_initialized():
            return True
    return False

# num_works=0
sampler = None
if distributed_is_initialized():
    sampler = data.DistributedSampler(train_dataset)
train_loader = data.DataLoader(train_dataset, batch_size=args.batch_size, 
                                shuffle=(sampler is None), sampler=sampler)

1.2.3 模型加载

与DataParallel方式的区别在于,需要现在model加载到GPU上,才能使用DistributedDataParallel进行分发。

if distributed_is_initialized():
    print("[Info] distributed training has been initialized")
    net.to(device)
    net = nn.parallel.DistributedDataParallel(net)
else:
    net = nn.DataParallel(net)
    net.to(device)

 需要注意的是,采用DistributedDataParallel方式进行多GPU训练的代码在vscode或其他编译器中是无法调试执行的,需要在shell下运行,方式如下:

python -m torch.distributed.launch train.py

 2. 单机多卡训练示例

# train.py

import torch
from torch import ne, nn
from torch import optim
from torch.utils import data
import torch.distributed as dist

from torchvision import transforms as T
from torchvision import datasets

from torchvision import models

import os
import argparse
from utils import progress_bar

print("=> set useful GPU sources...")
os.environ["CUDA_VISIBLE_DEVICES"] = "1, 2, 3, 4, 5"
device = torch.device('cuda' if torch.cuda.is_available()  else 'cpu')

def parser():
    parser = argparse.ArgumentParser(description='PyTorch CIFAR10 Training')
    parser.add_argument('--data', default='./data/fruit/train', metavar='DIR',
                    help='path to dataset')

    parser.add_argument('--outf', default='./output',
                    help='folder to output model checkpoints')
    parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
                    help='number of data loading workers (default: 4)')
    parser.add_argument('--max_epochs', default=90, type=int, metavar='N',
                    help='number of total epochs to run')
    parser.add_argument('--start_epoch', default=0, type=int, metavar='N',
                        help='manual epoch number (useful on restarts)')
    parser.add_argument('-b', '--batch-size', default=512, type=int,
                        metavar='N', help='mini-batch size (default: 256)')
    parser.add_argument('--lr', '--learning-rate', default=0.1, type=float,
                        metavar='LR', help='initial learning rate')
    parser.add_argument('--momentum', default=0.9, type=float, metavar='M',
                        help='momentum')
    parser.add_argument('--weight-decay', '--wd', default=1e-4, type=float,
                        metavar='W', help='weight decay (default: 1e-4)')
    
    parser.add_argument('--resume', '-r', action='store_true',
                        help='resume from checkpoint')
    # dataparallel
    parser.add_argument('--world_size', default=1, type=int,
                    help='number of nodes for distributed training')
    parser.add_argument('--local_rank', default=0, type=int,
                        help='node rank for distributed training')
    parser.add_argument('--dist_url', default='tcp://localhost:23456', type=str,
                        help='url used to set up distributed training')
    parser.add_argument('--dist_backend', default='nccl', type=str,
                        help='distributed backend')
    parser.add_argument('--multiprocessing_distributed', action='store_true', default=True,
                    help='Use multi-processing distributed training to launch '
                         'N processes per node, which has N GPUs. This is the '
                         'fastest way to use PyTorch for either single node or '
                         'multi node data parallel training')
    args = parser.parse_args()
    return args

args = parser()

def distributed_is_initialized():
    """ Checks if a distributed cluster has been initialized """
    if dist.is_available():
        if dist.is_initialized():
            return True
    return False
if args.multiprocessing_distributed:
        # intialize group
        dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                world_size=args.world_size, rank=args.local_rank)

print("==> prepare data..")
transform_x = T.Compose([
    T.RandomResizedCrop(224),
    T.RandomHorizontalFlip(),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])  # ImageNet st
])
# train_dataset = datasets.CIFAR10(root="./data", train=True, transform=transform_x, download=True)
# custom dataset
train_dataset = datasets.ImageFolder(root=args.data, transform=transform_x)
# get number of labels
labels = len(train_dataset.classes)

# num_works=0
sampler = None
if distributed_is_initialized():
    sampler = data.DistributedSampler(train_dataset)
train_loader = data.DataLoader(train_dataset, batch_size=args.batch_size, 
                                shuffle=(sampler is None), sampler=sampler)

print("==> Building model ..")

net = models.__dict__["resnet18"](num_classes=labels)

# TODO: multi-gpu training
if distributed_is_initialized():
    print("[Info] distributed training has been initialized")
    net.to(device)
    net = nn.parallel.DistributedDataParallel(net)
else:
    net = nn.DataParallel(net)
    net.to(device)

print("==> loss function and optimizer...")
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=args.lr, momentum=args.momentum, weight_decay=5e-4)

scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=200)

# Train
print("==> Performing training ...")
def train(args):
    net.train()
    for epoch in range(args.start_epoch, args.max_epochs):
        print("==> Epoch: %d"%(epoch))

        train_loss = 0
        correct_num = 0
        total_num = 0
        for iter, (inputs, targets) in enumerate(train_loader):
            inputs, targets = inputs.to(device), targets.to(device)

            optimizer.zero_grad()

            outputs = net(inputs)

            loss = criterion(outputs, targets)
            
            loss.backward()

            optimizer.step()
            # 
            train_loss += loss.item()
            _, predicted = outputs.max(1)
            total_num += targets.size(0)
            correct_num += predicted.eq(targets).sum().item()

            progress_bar(iter, len(train_loader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
                     % (train_loss/(iter+1), 100.*correct_num/total_num, correct_num, total_num))
        torch.save( net.state_dict(), os.path.join(args.outf, "checkpoint.pth"))

if __name__ == "__main__":

    train(args)

 利用上述代码可以实现在单机情况下,调用多GPU资源进行训练从而缩短训练时间,GPU的资源占用情况如下图所示。本实例只是一个简单的分类网络训练例子,对于其他任务可以按照上述代码进行修改。

pytorch ddp多机多卡原理 pytorch多卡训练_数据

3. 多机多卡

太难了,我放弃了!!!!!!!!!

4. 多GPU训练单GPU测试

使用多GPU训练可以大大缩短实验的周期,但是在测试阶段,通常情况下只需要单GPU就好了,但是当加载权重的时候会发现权重文件的key值匹配不上没法进行测试。

pytorch ddp多机多卡原理 pytorch多卡训练_pytorch ddp多机多卡原理_02

 上面问题出现的原因在于当使用多GPU训练时,会对网络进行封装,具体来说,就是在原始网络结构中添加一层module。

  • 单GPU下的网络结构

pytorch ddp多机多卡原理 pytorch多卡训练_pytorch_03

  •  多GPU下的网络结构

pytorch ddp多机多卡原理 pytorch多卡训练_pytorch_04

"module."去掉)

from collections import OrderedDict

def newCheckpoint(state_dict):
    '''
    输入为多GPU下权重文件内容
    '''
    new_state_dict = OrderedDict()
    for key in state_dict.keys():
        new_key = key[key.find(".")+1:]
        new_state_dict[new_key] = state_dict[key]
    
    return new_state_dict