Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems

Awan, Ammar Ahmad

Keyword Search

School Logo

awan-phd-thesis-final.pdf (5.49 MB)

Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems

Author Info

Awan, Ammar Ahmad

ORCID® Identifier

http://orcid.org/0000-0002-6272-3760

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=osu1587433770960088

Year and Degree

2020, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.

Abstract

Recent advances in Machine Learning (ML) and Deep Learning (DL) techniques have triggered key success stories in many application domains like Computer Vision, Speech Comprehension and Recognition, and Natural Language Processing. Large-scale Deep Neural Networks (DNNs) are primary drivers of these success stories. However, training complex DNN architectures that consist of millions of trainable parameters is compute-intensive. Training is done using a large number of examples (training data set) and can take from weeks to months to achieve state-of-the-art prediction capabilities (accuracy). To achieve higher accuracy, making the DNN deeper and larger has become a common strategy but it also leads to a significantly bigger memory footprint. Thus, DNN training is not only compute-intensive but also a memory-hungry process requiring gigabytes of memory. To accelerate the process of large-scale DNN training, this dissertation is focused on designing high-performance systems that can exploit thousands of CPUs and GPUs for faster training. The novel approach presented in this work is called the co-design of high-perfo-rmance communication middleware and DL frameworks. Co-design is necessary because of the complexity of the overall execution stack for modern DL frameworks. Broadly, this stack consists of many layers, which start from the application layer followed by the DL framework layer (e.g. TensorFlow). The next layer in the stack is the distributed training middleware layer (e.g. Horovod) that connects a DL framework to an underlying communication middleware (e.g. a Message Passing Interface (MPI) library). Finally, the communication middleware layer sits directly on top of the parallel hardware that consists of multiple CPU/GPU nodes connected with a high-performance network. \textit{The complexity of this stack coupled with inefficient existing approaches to utilize it has led to several problems}: First, there is a lack of credible and systematic performance studies of Machine Learning (ML) and Deep Learning (DL) frameworks. This is partly because frameworks are being designed by scientists who are experts in AI and ML but not in high-performance systems. Second, state-of-the-art ML/DL frameworks are built for productivity but not necessarily performance and scalability. Frameworks either support training models on a single CPU/GPU only, which is not enough for training in a reasonable time, or they only offer rudimentary support for distributed training, which is not sufficiently scalable on modern High Performance Computing (HPC) systems. Third, larger and deeper models being proposed do not scale well even though they offer better prediction capabilities (or accuracy). This is because these models have a much higher number of parameters (order of millions) that lead to large communication buffers and eventually to excessive synchronization overheads for distributed training. Fourth, the established communication middleware (e.g. MPI) that we wish to use for distributed training of DNNs has only been optimized for scientific applications. Thus, fundamental primitives in the communication middleware like Broadcast, Reduce, and Allreduce need to be redesigned to better support the emerging DL workloads. Fifth and final, because of the fundamental memory limit, emerging DNNs that are larger than the CPU's or the GPU's memory cannot be trained without explicit memory management schemes and novel model partitioning strategies. The proposed co-design approach alleviates these performance inefficiencies, leads to a white-box strategy for performance optimization, ensures scalability of DL frameworks to thousands of CPUs and GPUs, and acts as a path forward for investigation of next-generation DNNs. The key idea is that one should neither develop DL frameworks in isolation nor without matching the requirements of the framework to the capabilities of the underlying software and hardware stack. Instead, a comprehensive co-design across multiple layers should be employed so that the entire stack is optimized in a connected and coherent fashion. To address the problems in existing approaches, we investigate how HPC techniques and optimizations can be brought to DL systems. Broadly, an iterative approach is performed: we start with performance evaluation followed by the design and development of an actual system. Once a particular system is functional, we develop new techniques and/or apply HPC optimizations like faster collective communication and overlap of computation and communication to further improve its performance. The proposed systems that have been developed by following this approach are: 1) S-Caffe, 2) OC-DNN, and 3) HyPar-Flow. All these systems have been fully developed, scaled to hundreds of CPUs and GPUs on various HPC systems like ORNL Summit and TACC Frontera, and have been made publicly available through software releases and published papers.

Committee

Dhabaleswar Kumar Panda (Advisor)
Srinivasan Parthasarathy (Committee Member)
Radu Teodorescu (Committee Member)
Hari Subramoni (Committee Member)

Pages

240 p.

Subject Headings

Artificial Intelligence; Computer Science

Keywords

Data Parallelism; Model Parallelism; Hybrid Parallelism; Keras; Caffe; TensorFlow; PyTorch, MPI, Eager Execution; Deep Learning; Scalable DNN Training; MVAPICH2-GDR; CUDA-Aware MPI

Awan, A. A. (2020). Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587433770960088
APA Style (7th edition)
Awan, Ammar Ahmad. Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems. 2020. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1587433770960088.
MLA Style (8th edition)
Awan, Ammar Ahmad. "Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems." Doctoral dissertation, Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587433770960088
Chicago Manual of Style (17th edition)

Document number:

osu1587433770960088

Download Count:

634

Copyright Info

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations