Modern High-Performance Computing (HPC) systems are enabling scientists from different
research domains such as astrophysics, climate simulations, computational fluid dynamics,
drugs discovery, and others, to model and simulate computation-heavy problems
at different scales. In recent years, the resurgence of Artificial Intelligence (AI), particularly
Deep Learning (DL) algorithms, has been made possible by the evolution of these
HPC systems. The diversity of applications ranging from traditional scientific computing
to the training and inference of neural-networks are driving the evolution of processor and
interconnect technologies as well as communication middlewares.
Today’s multi-petaflop HPC systems are powered by dense multi-/many-core architectures
and this trend is expected to grow for next-generation systems. This rapid adoption of these
high core-density architectures by the current- and next-generation HPC systems, driven
by emerging application trends, are putting more emphasis on the middleware designers
to optimize various communication primitives to meet the diverse needs of the applications.
While these novelties in the processor architectures have led to increased on-chip
parallelism, they come at the cost of rendering traditional designs, employed by the communication
middlewares, to suffer from higher intra-node communication costs. Tackling
the computation and communication challenges that accompany these dense multi-/manycores
garner special design considerations.
Scientific and AI applications that rely on such large-scale HPC systems to achieve higher
performance and scalability often use Message Passing Interface (MPI), Partition Global
Address Space (PGAS), or a hybrid of both as underlying communication substrate. These
applications use various communication primitives (e.g., point-to-point, collectives, RMA)
and often use custom data layouts (e.g., derived datatypes), spending a fair bit of time
in communication and synchronization. The performance of these primitives offered by
the communication middlewares, running on multi-/many-core systems, dictate the overall
time to completion of the applications. Thus, optimizing the communication performance
of MPI and PGAS primitives for current and next-generation architectures to achieve better
performance for scientific and AI applications is of vital importance.
The communication protocols in MPI for intra-node communication have thus far relied
on POSIX shared-memory and kernel-assisted memory mapping mechanisms. However,
these approaches often pose various bottlenecks for high core-density systems. This is further
exacerbated by the multi-process model that production MPI (and majority of PGAS)
libraries follow (e.g., ranks in MPI are OS-level processes instead of threads). On the other
hand, while high throughput networks such as InfiniBand and Omni-Path are capable of
driving multiple communications, the communication-heavy MPI primitives such as collectives,
have not been exploiting it fully to derive higher message rates.
To address the performance challenges posed by a diverse range of applications and the
lacking support in state-of-the-art communication libraries to exploit high-concurrency architectures,
we propose a “shared-address-spaces”-based communication framework that
allows multi-process model of MPI to have thread-like load/store accesses between different
address spaces. Atop this framework, we re-design MPI point-to-point protocols (e.g.,
user-space zero-copy rendezvous communication), collective communication primitives (e.g.,
load/store based collectives, truly zero-copy and partitioning-based reduction algorithms),
and efficient MPI derived datatypes processing (e.g., memoization-based “packingfree”
communication) to exploit high-concurrency made available by modern multi-/manycore
architectures and high-throughput interconnects. Furthermore, to address the lack of
association between the dynamic communication patterns of emerging application workloads
and dense many-core architectures, we propose a set of low-level benchmarking
based approaches and MPI-level designs to infer vendor-specific machine characteristics
e.g., physical to virtual machine topologies, and dynamic communication patterns of the
applications. By utilizing this information, we propose two novel algorithms to construct
efficient MPI mappings for any given architecture and application communication pattern.
The proposed designs are implemented in the MVAPICH2 MPI library and are evaluated
on three different architectures using various micro-benchmarks and application kernels.