Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems

Hashmi, Jahanzeb Maqbool

Abstract Details

2020, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Modern High-Performance Computing (HPC) systems are enabling scientists from different research domains such as astrophysics, climate simulations, computational fluid dynamics, drugs discovery, and others, to model and simulate computation-heavy problems at different scales. In recent years, the resurgence of Artificial Intelligence (AI), particularly Deep Learning (DL) algorithms, has been made possible by the evolution of these HPC systems. The diversity of applications ranging from traditional scientific computing to the training and inference of neural-networks are driving the evolution of processor and interconnect technologies as well as communication middlewares. Today’s multi-petaflop HPC systems are powered by dense multi-/many-core architectures and this trend is expected to grow for next-generation systems. This rapid adoption of these high core-density architectures by the current- and next-generation HPC systems, driven by emerging application trends, are putting more emphasis on the middleware designers to optimize various communication primitives to meet the diverse needs of the applications. While these novelties in the processor architectures have led to increased on-chip parallelism, they come at the cost of rendering traditional designs, employed by the communication middlewares, to suffer from higher intra-node communication costs. Tackling the computation and communication challenges that accompany these dense multi-/manycores garner special design considerations. Scientific and AI applications that rely on such large-scale HPC systems to achieve higher performance and scalability often use Message Passing Interface (MPI), Partition Global Address Space (PGAS), or a hybrid of both as underlying communication substrate. These applications use various communication primitives (e.g., point-to-point, collectives, RMA) and often use custom data layouts (e.g., derived datatypes), spending a fair bit of time in communication and synchronization. The performance of these primitives offered by the communication middlewares, running on multi-/many-core systems, dictate the overall time to completion of the applications. Thus, optimizing the communication performance of MPI and PGAS primitives for current and next-generation architectures to achieve better performance for scientific and AI applications is of vital importance. The communication protocols in MPI for intra-node communication have thus far relied on POSIX shared-memory and kernel-assisted memory mapping mechanisms. However, these approaches often pose various bottlenecks for high core-density systems. This is further exacerbated by the multi-process model that production MPI (and majority of PGAS) libraries follow (e.g., ranks in MPI are OS-level processes instead of threads). On the other hand, while high throughput networks such as InfiniBand and Omni-Path are capable of driving multiple communications, the communication-heavy MPI primitives such as collectives, have not been exploiting it fully to derive higher message rates. To address the performance challenges posed by a diverse range of applications and the lacking support in state-of-the-art communication libraries to exploit high-concurrency architectures, we propose a “shared-address-spaces”-based communication framework that allows multi-process model of MPI to have thread-like load/store accesses between different address spaces. Atop this framework, we re-design MPI point-to-point protocols (e.g., user-space zero-copy rendezvous communication), collective communication primitives (e.g., load/store based collectives, truly zero-copy and partitioning-based reduction algorithms), and efficient MPI derived datatypes processing (e.g., memoization-based “packingfree” communication) to exploit high-concurrency made available by modern multi-/manycore architectures and high-throughput interconnects. Furthermore, to address the lack of association between the dynamic communication patterns of emerging application workloads and dense many-core architectures, we propose a set of low-level benchmarking based approaches and MPI-level designs to infer vendor-specific machine characteristics e.g., physical to virtual machine topologies, and dynamic communication patterns of the applications. By utilizing this information, we propose two novel algorithms to construct efficient MPI mappings for any given architecture and application communication pattern. The proposed designs are implemented in the MVAPICH2 MPI library and are evaluated on three different architectures using various micro-benchmarks and application kernels.
Dhabaleswar K. (DK) Panda (Advisor)
Radu Teodorescu (Committee Member)
Feng Qin (Committee Member)
Hari Subramoni (Committee Member)
186 p.

Recommended Citations

Citations

  • Hashmi, J. M. (2020). Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1588038721555713

    APA Style (7th edition)

  • Hashmi, Jahanzeb Maqbool. Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems. 2020. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1588038721555713.

    MLA Style (8th edition)

  • Hashmi, Jahanzeb Maqbool. "Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems." Doctoral dissertation, Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1588038721555713

    Chicago Manual of Style (17th edition)