分布式深度学习算法分析和评测

原创

marsggbo 2021-07-26 14:11:17 ©著作权

文章标签 分布式 文章分类 代码人生

©著作权归作者所有：来自51CTO博客作者marsggbo的原创作品，请联系作者获取转载授权，否则将追究法律责任

我们实验室最新出炉的分布式深度学习算法分析和评测，涵盖PS vs All-to-All，TCP/IP vs RDMA, Ethernet vs IB. 欢迎大家批评指正和分享转发.

ddl-benchmarks: Benchmarks for Distributed Deep Learning.

论文部分内容截图

分布式深度学习算法分析和评测_分布式

分布式深度学习算法分析和评测_分布式_02

Introduction

This repository contains a set of benchmarking scripts for evaluating the training performance of popular distributed deep learning methods, which mainly focuses on system-level optimization algorithms of synchronized stochastic gradient descent with data parallelism. Currently, it covers:

system architectures

Parameter server with BytePS^[1].
All-to-all with Horovod^[2].

optimization algorithms

Wait-free backpropagation (WFBP), which is also known as the technique of pipelining the backward computations with gradient communications and it is a default feature in current deep learning frameworks.
Tensor fusion, which has been integraded in Horovod with a hand-craft threshold to determine when to fusion tensors, but it is possible to dynamically determine to fusion tensors in MG-WFBP^[3].
Tensor partition and priority schedule, which are proposed in ByteScheduler^[4].
Gradient compression with quantization (i.e., signSGD^[5]) and sparsification (i.e., TopK-SGD^[6]). These methods are included in the code, but they are excluded from our paper as the paper focuses on the system-level optimization methods.

deep neural networks

Convolutional neural networks (CNNs)^[7] on a fake ImageNet data set (i.e., randomly generate the input image of 224*224*3)
Transformers^[8]: BERT-Base and BERT-Large pretraining models.

Installation

Prerequisites

Python 3.6+
CUDA-10.+
NCCL-2.4.+
PyTorch-1.4.+^[9]
OpenMPI-4.0.+^[10]
Horovod-0.19.+^[11]
BytePS-0.2.+^[12]
ByteScheduler^[13]
bit2byte^[14]: Optional if not run signSGD.

Get the code

$git clone https://github.com/HKBU-HPML/ddl-benchmarks.git$cd ddl-benchmarks $pip install -r requirements.txt

Configure the cluster settings

Before running the scripts, please carefully configure the configuration files in the directory of configs.

configs/cluster*: configure the host files for MPI
configs/envs.conf: configure the cluster enviroments

Create a log folder, e.g.,

$mkdir -p logs/pcie

Run benchmarks

The batch mode

$python benchmarks.py

The individual mode, e.g.,

$cd horovod$dnn=resnet50 bs=64 nworkers=64 ./horovod_mpi_cj.sh

Paper

If you are using this repository for your paper, please cite our work

@article{shi2020ddlsurvey,    author = {Shi, Shaohuai and Tang, Zhenheng and Chu, Xiaowen and Liu, Chengjian and Wang, Wei and Li, Bo},    title = {Communication-Efficient Distributed Deep Learning: Survey, Evaluation, and Challenges},    journal = {arXiv},    year = {2020}}

参考资料

[1]

BytePS: https://github.com/bytedance/byteps

[2]

Horovod: https://github.com/horovod/horovod

[3]

MG-WFBP: https://github.com/HKBU-HPML/MG-WFBP

[4]

ByteScheduler: https://github.com/bytedance/byteps/tree/bytescheduler/bytescheduler

[5]

signSGD: https://github.com/jiaweizzhao/signSGD-with-Majority-Vote

[6]

TopK-SGD: https://github.com/hclhkbu/gtopkssgd

[7]

Convolutional neural networks (CNNs): https://pytorch.org/docs/stable/torchvision/models.html

[8]

Transformers: https://github.com/huggingface/transformers

[9]

PyTorch-1.4.+: https://download.pytorch.org/whl/torch_stable.html

[10]

OpenMPI-4.0.+: https://www.open-mpi.org/software/ompi/v4.0/

[11]

Horovod-0.19.+: https://github.com/horovod/horovod

[12]

BytePS-0.2.+: https://github.com/bytedance/byteps

[13]

ByteScheduler: https://github.com/bytedance/byteps/tree/bytescheduler/bytescheduler

[14]

bit2byte: https://github.com/jiaweizzhao/signSGD-with-Majority-Vote/tree/master/main/bit2byte-extension

上一篇：Transformer自下而上理解(1) Sequence-to-Sequence模型

下一篇：新记录诞生，腾讯云2分31秒打破ImageNet训练记录

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯