Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures

Abstract Details

2019, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Modern high-performance computing (HPC) systems are enabling scientists to tackle various grand challenge problems in diverse domains including cosmology and astrophysics, earthquake and weather analysis, molecular dynamics and physics modeling, biological computations, and computational fluid dynamics among others. Along with the increasing demand for computing power, these applications are creating fundamental new challenges in terms of communication complexity, scalability, and reliability. At the same time, remote and virtualized clouds are rapidly gaining in popularity compared to on-premise clusters due to lower initial cost and greater flexibility. These requirements are driving the evolution of modern HPC processors, interconnects, storage systems, as well as middleware and runtimes. However, a large number of scientific applications have irregular and/or dynamic computation and communication patterns that require different approaches to extract the best performance. The increasing scale of HPC systems coupled with the diversity of emerging architectures, including the advent of multi-/many-core processors and Remote Direct Memory Access (RDMA) aware networks have exacerbated this problem by making a "one-size-fits-all" policy non-viable. Thus, a fundamental shift is required in how HPC middleware interact with the application and react to its computation and communication requirements. Furthermore, current generation middleware consist of many independent components like the communication runtime, resource manager, job launcher etc. However, the lack of cooperation among these components often limits the performance and scalability of the end-application. To address these challenges, we propose a high-performance and scalable "Cooperative Communication Middleware" for HPC systems. The middleware supports MPI (Message Passing Interface), PGAS (Partitioned Global Address Space), and hybrid MPI+PGAS programming models and provides improved point-to-point communication, contention-aware and kernel-assisted collectives, fast job startup, and scalable fault-tolerance primitives. The major contribution of this new middleware is to leverage cooperation within the same component as well across different components in order to provide high performance, scalability, and reliability for the end-user. For example, the sender and the receiver process can cooperate with each other to determine the best way to realize a particular point-to-point communication operations. Similarly, multiple process can cooperate to reduce the contention in a collective communication operation. We can further extend this approach through cooperation of different components of the middleware such as the communication runtime and the resource manager. This cooperation also enables the middleware to dynamically adapt to the application's computation and communication requirements. Compared to the state-of-the-art, the proposed middleware shows up to 2 times improvement in large message bandwidth and latency, up to 50 times improvement in performance of MPI collectives, and up to 19% reduction in the runtime of applications from different domains. It also shows significant improvement in scalability by reducing the recovery time by up to 4 times on 4,096 processes and improving the job startup time by up to 8.8 times for 231,936 MPI processes on 3,624 compute nodes.
Dhabaleswar K Panda (Advisor)
Gagan Agrawal (Committee Member)
Ponnuswamy Sadayappan (Committee Member)
Hari Subramoni (Committee Member)
202 p.

Recommended Citations

Citations

  • Chakraborty, S. (2019). High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1563484522149971

    APA Style (7th edition)

  • Chakraborty, Sourav. High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures. 2019. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1563484522149971.

    MLA Style (8th edition)

  • Chakraborty, Sourav. "High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures." Doctoral dissertation, Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1563484522149971

    Chicago Manual of Style (17th edition)