Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Designing High Performance Hardware-assisted Communication Middlewares for Next-Generation HPC Systems

Bayatpour, Mohammadreza

Abstract Details

2021, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
In the past decade, there has been a sharp spike in the computational complexity of applications for artificial intelligence, scientific, and commerce domains. High-Performance Computing (HPC) systems are able to address various challenges that come along with these applications. The increasing demands of these heavy-compute softwares for more data-processing power are creating fundamental new challenges in terms of performance and scalability. A dramatic growth in the size and scale of HPC systems can be observed from the TOP500 list in recent years. Emerging applications that use these systems have unprecedented volumes of data being transferred between various components. Based on these trends, to reach Exascale performance for end applications, it is desirable to bring compute capabilities to the data instead of moving the data to the compute elements and have support for in-network computing. On the other hand, HPC systems are powered by dense multi- /many-core architectures and this complexity grows for next-generation systems. Such high core-density architectures in the current- and next-generation HPC systems, as well as the emergence of high-performance in-network computing in the state-of-the-art RDMA-enabled interconnects such as InfiniBand, require the middleware designers to optimize various communication primitives to deliver the peak HPC systems performance to the end application. The runtime latency of scientific and AI applications running on RDMA- enabled multi-/many-core systems hugely depends on the performance of the primitives offered by the communication middlewares. The Message Passing Interface (MPI) is a very popular parallel programming model used by various applications and it offers several communication primitives such as point-to-point, collectives, and synchronization. Furthermore, these applications rely heavily on the performance and portability offered by this library. Thus, it is vital to optimize the communication performance of MPI primitives to achieve better performance for these applications. Many applications rely on the direct usage of point-to-point primitives. There are two major aspects affecting the performance of these primitives: `tag-matching’ operations which is an operation in the critical path that correctly places the incoming data in the application buffer, as well as the amount of communication latency overlapped by the application computations. Therefore, decreasing the cost of tag-matching operations and increasing the communication and computation overlap is of vital importance. In this dissertation, we take up this challenge and propose adaptive and hardware-assisted designs which dynamically adapt to the communication load at each individual process at runtime. Due to its enormous usage in scientific applications and deep learning frameworks, MPI_Allreduce is the most widely used collective operation. Existing designs for MPI Allreduce do not take advantage of the vast parallelism available in modern multi-/many-core processors or the increases in communication throughput as well as recent advances in in-network computing features seen with modern interconnects like InfiniBand. In this dissertation, we take up this challenge and propose scalable, adaptive, and hardware-assisted designs for MPI Allreduce to guarantee its peak performance for a wide range of message sizes from small to very large message sizes. Network offload mechanisms are gaining attraction as they have the potential to completely offload the communication of MPI primitives into the network, maximizing the overlap of communication and computation. However, the area of network offloading of MPI primitives is still nascent and cannot be used as a universal solution. Modern smart NICs such as BlueField are able to bring more compute resources into the network and high performance middleware such as MPI must take advantage of these additional resources to fill in the limitations of other in-network technologies. In this thesis, we take up this challenge and propose high performance and scalable Smart NIC-based designs that efficiently offload the dense collective communications from the host CPU to available compute on BlueField smart NIC. The proposed designs have been integrated into the MVAPICH2 MPI library, and this library is publicly available to the scientific community.
Dhabaleswar K. Panda (Advisor)
Radu Teodorescu (Committee Member)
Feng Qin (Committee Member)
Hari Subramoni (Committee Member)
187 p.

Recommended Citations

Citations

  • Bayatpour, M. (2021). Designing High Performance Hardware-assisted Communication Middlewares for Next-Generation HPC Systems [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1619140614941402

    APA Style (7th edition)

  • Bayatpour, Mohammadreza. Designing High Performance Hardware-assisted Communication Middlewares for Next-Generation HPC Systems. 2021. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1619140614941402.

    MLA Style (8th edition)

  • Bayatpour, Mohammadreza. "Designing High Performance Hardware-assisted Communication Middlewares for Next-Generation HPC Systems." Doctoral dissertation, Ohio State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=osu1619140614941402

    Chicago Manual of Style (17th edition)