High Performance Computing is enabling rapid innovations spanning several key areas ranging from science, technology and manufacturing disciplines to entertainment and financial markets. One computing paradigm contributing significantly to the outreach of such capabilitiesis Cluster Computing. Cluster computing involves the use of
multiple Commodity PCs interconnected by a network to provide the required computational resource in a cost-effective manner. Recently, commodity clusters are rapidly transforming into capability class machines with several of them featuring in the Top 10 list of supercomputers. The two primary drivers for this trend being: a) Advent
of Multicore technology and b) Performance and Scalability of InfiniBand, an open standard based interconnection network. These two factors are ushering in an era of ultra-scale InfiniBand Multicore clusters comprising of tens of thousands of compute cores.
Utilizing Message Passing Interface (MPI) is the most popular method of programming parallel appplications. In this model, communication occurs via explicit exchange of data messages. MPI provides for plethora of communication primitives out of which Collective primtives are especially significant. These are extensively used in a variety of scientific and engineering applications (such as to compute fast fourier transforms and multiply large matrices, etc.). It is imperative that these collectives be designed efficiently to ensure good performance and scalability. MPI collectives pose several challenges and requirements in terms of guaranteeing data reliability, enabling efficient scalable means of data transfers and providing for process skew tolerance mechanisms. Moreover, the characteristics of underlying network and multicore systems directly impact the behavior of the collective operations and need to be taken into consideration for optimizing performance and resource usage.
In this dissertation, we take on these challenges to design a Scalable and High Performance Collective Communication subsystem for MPI over InfiniBand Multicore clusters. The central theme used in our approach is to have an in-depth understanding of the capabilities of underlying network/system architecture and leverage these to provide
optimal design alternatives. Specifically, the dissertation describes novel communication protocols and algorithms utilizing a) InfiniBand's hardware Multicast, RDMA capabilities and b) System's shared memory to meet the stated requirements and challenges. Also, the collective
optimizations discussed in the dissertation take into account the different transport methods of InfiniBand and the architectural attributes of Multicore systems. The designs proposed in the dissertation have been incorporated into the open source MVAPICH software used by more than 680 organizations worldwide. It is used in several cluster installations, and currently used by the world's third
fastest supercomputer.