Over the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands
of computers together to form high performance clusters. These
clusters are used to solve computationally challenging
scientific problems. The Message Passing Interface (MPI) is a
popular model to write applications for these clusters. There
are a vast array of scientific applications which use MPI on
clusters. As the applications operate on larger and more complex
data, the size of the compute clusters is scaling increasingly
higher. The scalability and the performance of the MPI library
if very important for the end application performance.
InfiniBand is a cluster interconnect which is based on open-standards
and is gaining rapid acceptance. This dissertation explores the different
transports provided by InfiniBand to determine the scalabilty and performance
aspects of each. Further, new MPI designs have been proposed and implemented
for transports that have never been used for MPI in the past. These designs
have significantly decreased the resource consumption, increased the
performance and increased the reliability of ultra-scale InfiniBand clusters.
A framework to simultaneously use multiple transports of InfiniBand and
dynamically change transfer protcols has been designed and evaluated. Evaluations
show that memory can be reduced from over 1 GB per MPI process to 40 MB per MPI
process. In addition, performance using this design has been improved by up to 30%
over earlier designs. Investigations into providing reliability have shown that the MPI
library can be designed to withstand many network faults and also how to design
reliability in software to provide higher message rates than in hardware.
Software developed as a part of this dissertation is available in MVAPICH,
which is a popular open-source implementation of MPI over InfiniBand and is
used by several hundred top computing sites all around the world.