我们实验室最新出炉的分布式深度学习算法分析和评测,涵盖PS vs All-to-All,TCP/IP vs RDMA, Ethernet vs IB. 欢迎大家批评指正和分享转发.

 

ddl-benchmarks: Benchmarks for Distributed Deep Learning.

论文部分内容截图

分布式深度学习算法分析和评测_分布式

 

分布式深度学习算法分析和评测_分布式_02

 

Introduction

This repository contains a set of benchmarking scripts for evaluating the training performance of popular distributed deep learning methods, which mainly focuses on system-level optimization algorithms of synchronized stochastic gradient descent with data parallelism. Currently, it covers:

system architectures

  • Parameter server with BytePS[1].
  • All-to-all with Horovod[2].

optimization algorithms

  • Wait-free backpropagation (WFBP), which is also known as the technique of pipelining the backward computations with gradient communications and it is a default feature in current deep learning frameworks.
  • Tensor fusion, which has been integraded in Horovod with a hand-craft threshold to determine when to fusion tensors, but it is possible to dynamically determine to fusion tensors in MG-WFBP[3].
  • Tensor partition and priority schedule, which are proposed in ByteScheduler[4].
  • Gradient compression with quantization (i.e., signSGD[5]) and sparsification (i.e., TopK-SGD[6]). These methods are included in the code, but they are excluded from our paper as the paper focuses on the system-level optimization methods.

deep neural networks

  • Convolutional neural networks (CNNs)[7] on a fake ImageNet data set (i.e., randomly generate the input image of 224*224*3)
  • Transformers[8]: BERT-Base and BERT-Large pretraining models.

Installation

Prerequisites

  • Python 3.6+
  • CUDA-10.+
  • NCCL-2.4.+
  • PyTorch-1.4.+[9]
  • OpenMPI-4.0.+[10]
  • Horovod-0.19.+[11]
  • BytePS-0.2.+[12]
  • ByteScheduler[13]
  • bit2byte[14]: Optional if not run signSGD.

Get the code

  •  
$git clone https://github.com/HKBU-HPML/ddl-benchmarks.git$cd ddl-benchmarks $pip install -r requirements.txt

Configure the cluster settings

Before running the scripts, please carefully configure the configuration files in the directory of configs.

  • configs/cluster*: configure the host files for MPI
  • configs/envs.conf: configure the cluster enviroments

Create a log folder, e.g.,

  •  
$mkdir -p logs/pcie

Run benchmarks

  • The batch mode
  •  
$python benchmarks.py
  • The individual mode, e.g.,
  •  
$cd horovod$dnn=resnet50 bs=64 nworkers=64 ./horovod_mpi_cj.sh

Paper

If you are using this repository for your paper, please cite our work

  •  
@article{shi2020ddlsurvey,    author = {Shi, Shaohuai and Tang, Zhenheng and Chu, Xiaowen and Liu, Chengjian and Wang, Wei and Li, Bo},    title = {Communication-Efficient Distributed Deep Learning: Survey, Evaluation, and Challenges},    journal = {arXiv},    year = {2020}}

参考资料

[1]

BytePS: https://github.com/bytedance/byteps

[2]

Horovod: https://github.com/horovod/horovod

[3]

MG-WFBP: https://github.com/HKBU-HPML/MG-WFBP

[4]

ByteScheduler: https://github.com/bytedance/byteps/tree/bytescheduler/bytescheduler

[5]

signSGD: https://github.com/jiaweizzhao/signSGD-with-Majority-Vote

[6]

TopK-SGD: https://github.com/hclhkbu/gtopkssgd

[7]

Convolutional neural networks (CNNs): https://pytorch.org/docs/stable/torchvision/models.html

[8]

Transformers: https://github.com/huggingface/transformers

[9]

PyTorch-1.4.+: https://download.pytorch.org/whl/torch_stable.html

[10]

OpenMPI-4.0.+: https://www.open-mpi.org/software/ompi/v4.0/

[11]

Horovod-0.19.+: https://github.com/horovod/horovod

[12]

BytePS-0.2.+: https://github.com/bytedance/byteps

[13]

ByteScheduler: https://github.com/bytedance/byteps/tree/bytescheduler/bytescheduler

[14]

bit2byte: https://github.com/jiaweizzhao/signSGD-with-Majority-Vote/tree/master/main/bit2byte-extension