Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
thesis.pdf (2.37 MB)
ETD Abstract Container
Abstract Header
High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures
Author Info
Chakraborty, Sourav
ORCID® Identifier
http://orcid.org/0000-0002-7244-5397
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=osu1563484522149971
Abstract Details
Year and Degree
2019, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Abstract
Modern high-performance computing (HPC) systems are enabling scientists to tackle various grand challenge problems in diverse domains including cosmology and astrophysics, earthquake and weather analysis, molecular dynamics and physics modeling, biological computations, and computational fluid dynamics among others. Along with the increasing demand for computing power, these applications are creating fundamental new challenges in terms of communication complexity, scalability, and reliability. At the same time, remote and virtualized clouds are rapidly gaining in popularity compared to on-premise clusters due to lower initial cost and greater flexibility. These requirements are driving the evolution of modern HPC processors, interconnects, storage systems, as well as middleware and runtimes. However, a large number of scientific applications have irregular and/or dynamic computation and communication patterns that require different approaches to extract the best performance. The increasing scale of HPC systems coupled with the diversity of emerging architectures, including the advent of multi-/many-core processors and Remote Direct Memory Access (RDMA) aware networks have exacerbated this problem by making a "one-size-fits-all" policy non-viable. Thus, a fundamental shift is required in how HPC middleware interact with the application and react to its computation and communication requirements. Furthermore, current generation middleware consist of many independent components like the communication runtime, resource manager, job launcher etc. However, the lack of cooperation among these components often limits the performance and scalability of the end-application. To address these challenges, we propose a high-performance and scalable "Cooperative Communication Middleware" for HPC systems. The middleware supports MPI (Message Passing Interface), PGAS (Partitioned Global Address Space), and hybrid MPI+PGAS programming models and provides improved point-to-point communication, contention-aware and kernel-assisted collectives, fast job startup, and scalable fault-tolerance primitives. The major contribution of this new middleware is to leverage cooperation within the same component as well across different components in order to provide high performance, scalability, and reliability for the end-user. For example, the sender and the receiver process can cooperate with each other to determine the best way to realize a particular point-to-point communication operations. Similarly, multiple process can cooperate to reduce the contention in a collective communication operation. We can further extend this approach through cooperation of different components of the middleware such as the communication runtime and the resource manager. This cooperation also enables the middleware to dynamically adapt to the application's computation and communication requirements. Compared to the state-of-the-art, the proposed middleware shows up to 2 times improvement in large message bandwidth and latency, up to 50 times improvement in performance of MPI collectives, and up to 19% reduction in the runtime of applications from different domains. It also shows significant improvement in scalability by reducing the recovery time by up to 4 times on 4,096 processes and improving the job startup time by up to 8.8 times for 231,936 MPI processes on 3,624 compute nodes.
Committee
Dhabaleswar K Panda (Advisor)
Gagan Agrawal (Committee Member)
Ponnuswamy Sadayappan (Committee Member)
Hari Subramoni (Committee Member)
Pages
202 p.
Subject Headings
Computer Engineering
;
Computer Science
Keywords
Point-to-Point, Collective Communication, Kernel Assisted Communication, Shared Memory, Job Startup, Cooperative Communication Middleware, Scalable Runtime, Message Passing, MPI, OpenSHMEM, HPC
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Chakraborty, S. (2019).
High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures
[Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1563484522149971
APA Style (7th edition)
Chakraborty, Sourav.
High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures.
2019. Ohio State University, Doctoral dissertation.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=osu1563484522149971.
MLA Style (8th edition)
Chakraborty, Sourav. "High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures." Doctoral dissertation, Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1563484522149971
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
osu1563484522149971
Download Count:
823
Copyright Info
© 2019, all rights reserved.
This open access ETD is published by The Ohio State University and OhioLINK.