我们实验室最新出炉的分布式深度学习算法分析和评测,涵盖PS vs All-to-All,TCP/IP vs RDMA, Ethernet vs IB. 欢迎大家批评指正和分享转发.
ddl-benchmarks: Benchmarks for Distributed Deep Learning.
论文部分内容截图
Introduction
This repository contains a set of benchmarking scripts for evaluating the training performance of popular distributed deep learning methods, which mainly focuses on system-level optimization algorithms of synchronized stochastic gradient descent with data parallelism. Currently, it covers:
system architectures
- Parameter server with BytePS[1].
- All-to-all with Horovod[2].
optimization algorithms
- Wait-free backpropagation (WFBP), which is also known as the technique of pipelining the backward computations with gradient communications and it is a default feature in current deep learning frameworks.
- Tensor fusion, which has been integraded in Horovod with a hand-craft threshold to determine when to fusion tensors, but it is possible to dynamically determine to fusion tensors in MG-WFBP[3].
- Tensor partition and priority schedule, which are proposed in ByteScheduler[4].
- Gradient compression with quantization (i.e., signSGD[5]) and sparsification (i.e., TopK-SGD[6]). These methods are included in the code, but they are excluded from our paper as the paper focuses on the system-level optimization methods.
deep neural networks
- Convolutional neural networks (CNNs)[7] on a fake ImageNet data set (i.e., randomly generate the input image of 224*224*3)
- Transformers[8]: BERT-Base and BERT-Large pretraining models.
Installation
Prerequisites
- Python 3.6+
- CUDA-10.+
- NCCL-2.4.+
- PyTorch-1.4.+[9]
- OpenMPI-4.0.+[10]
- Horovod-0.19.+[11]
- BytePS-0.2.+[12]
- ByteScheduler[13]
- bit2byte[14]: Optional if not run signSGD.
Get the code
$git clone https://github.com/HKBU-HPML/ddl-benchmarks.git
$cd ddl-benchmarks
$pip install -r requirements.txt
Configure the cluster settings
Before running the scripts, please carefully configure the configuration files in the directory of configs
.
- configs/cluster*: configure the host files for MPI
- configs/envs.conf: configure the cluster enviroments
Create a log folder, e.g.,
$mkdir -p logs/pcie
Run benchmarks
- The batch mode
$python benchmarks.py
- The individual mode, e.g.,
$cd horovod
$dnn=resnet50 bs=64 nworkers=64 ./horovod_mpi_cj.sh
Paper
If you are using this repository for your paper, please cite our work
@article{shi2020ddlsurvey,
author = {Shi, Shaohuai and Tang, Zhenheng and Chu, Xiaowen and Liu, Chengjian and Wang, Wei and Li, Bo},
title = {Communication-Efficient Distributed Deep Learning: Survey, Evaluation, and Challenges},
journal = {arXiv},
year = {2020}
}
参考资料
[1]BytePS: https://github.com/bytedance/byteps
[2]Horovod: https://github.com/horovod/horovod
[3]MG-WFBP: https://github.com/HKBU-HPML/MG-WFBP
[4]ByteScheduler: https://github.com/bytedance/byteps/tree/bytescheduler/bytescheduler
[5]signSGD: https://github.com/jiaweizzhao/signSGD-with-Majority-Vote
[6]TopK-SGD: https://github.com/hclhkbu/gtopkssgd
[7]Convolutional neural networks (CNNs): https://pytorch.org/docs/stable/torchvision/models.html
[8]Transformers: https://github.com/huggingface/transformers
[9]PyTorch-1.4.+: https://download.pytorch.org/whl/torch_stable.html
[10]OpenMPI-4.0.+: https://www.open-mpi.org/software/ompi/v4.0/
[11]Horovod-0.19.+: https://github.com/horovod/horovod
[12]BytePS-0.2.+: https://github.com/bytedance/byteps
[13]ByteScheduler: https://github.com/bytedance/byteps/tree/bytescheduler/bytescheduler
[14]bit2byte: https://github.com/jiaweizzhao/signSGD-with-Majority-Vote/tree/master/main/bit2byte-extension