Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
Dissertation_ChingHsiangChu_OSU.pdf (6.75 MB)
ETD Abstract Container
Abstract Header
Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects
Author Info
Chu, Ching-Hsiang
ORCID® Identifier
http://orcid.org/0000-0002-6752-3135
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=osu1595451131152
Abstract Details
Year and Degree
2020, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Abstract
In the era of post Moore's law, the traditional general-purpose CPU is not able to keep the pace up and provide the computing power demanded by the modern compute-intensive and highly parallelizable applications. Under this context, various accelerator architectures such as tensor processing unit (TPU), field-programmable gate array (FPGA), and graphics processing unit (GPU) are being designed to meet the high computational demands. Notably, the GPU has been widely adopted in high-performance computing (HPC) and cloud systems to significantly accelerate numerous scientific and emerging machine/deep learning (ML/DL) applications. To seek more computing power, researchers and engineers are building large-scale GPU clusters, i.e., scale-out. Moreover, the recent advent of high-speed interconnects technology such as NVIDIA NVLink and AMD Infinity fabric enables the deployment of dense GPU systems, i.e., scale-up. As a result, we are witnessing that six out of the top 10 supercomputers, as of July 2020, are powered by thousands of NVIDIA GPUs with NVLink and InfiniBand networks. Driven by the ever large GPU systems, GPU-Aware Message Passing Interface (MPI) has become the standard programming model for developing GPU-enabled parallel applications. However, the state-of-the-art GPU-Aware MPI libraries are predominantly optimized by leveraging advanced technology like Remote Direct Memory Access (RDMA) and NOT exploiting GPUs' computational power. There is a dearth of research in designing GPU-enabled communication middleware that efficiently handles end-to-end networking and harnesses computational power provided by the accelerators. In this thesis, we take the GPU as an example to demonstrate how to design accelerator-enabled communication middleware that harness hardware computational resources and cutting-edge interconnects for high-performance and scalable communication on the modern and next-generation heterogeneous HPC systems. Specifically, this thesis addresses three primary communication patterns: 1) Scalable one-to-all broadcast operations to leverage low-level hardware multicast and GPUDirect RDMA features, 2) Topology-aware, link-efficient, and cooperative GPU-driven schemes significantly accelerate All-to-one and All-to-all reduction operation, i.e., All-reduce, for ML/DL applications, 3) Adaptive CPU-GPU hybrid packing/unpacking with dynamic kernel fusion and zero-copy schemes for non-contiguous data transfer. The proposed scalable broadcast schemes yield 64% performance improvement for the streaming workload on 88 GPUs. The link-efficient Allreduce designs help ML/DL frameworks such as Tensorflow, PyTorch, and Horovod to scale distributed training over 1,536 GPUs on the Summit system. Moreover, it outperforms the state-of-the-art NCCL by up to 1.5X for training image data with the ResNet-50 model using PyTorch. The adaptive MPI derived datatype processing eliminates the expensive packing/unpacking and data movement operations on the dense-GPU systems. Moreover, it achieves up to three orders of magnitude faster than the production libraries for the 3D domain decomposition, a critical method to power various scientific applications such as weather forecast and molecular dynamics simulations. Finally, the proposed designs are made publicly available under the MVAPICH2-GDR library for the HPC community.
Committee
Dhabaleswar Panda, K (Advisor)
Radu Teodorescu (Committee Member)
Feng Qin (Committee Member)
Hari Subramoni (Committee Member)
Pages
205 p.
Subject Headings
Computer Engineering
;
Computer Science
Keywords
Accelerator, Communication Middleware, GPU, HPC, Heterogeneous, Interconnect, MPI, Network, Non-contiguous Data, RDMA
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Chu, C.-H. (2020).
Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects
[Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1595451131152
APA Style (7th edition)
Chu, Ching-Hsiang.
Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects.
2020. Ohio State University, Doctoral dissertation.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=osu1595451131152.
MLA Style (8th edition)
Chu, Ching-Hsiang. "Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects." Doctoral dissertation, Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1595451131152
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
osu1595451131152
Download Count:
479
Copyright Info
© 2020, all rights reserved.
This open access ETD is published by The Ohio State University and OhioLINK.