Many applications from various fields such as life sciences, weatherforecasting, financial services require massive amounts of computational power.
Supercomputers built using commodity components, called clusters, are a very
cost effective way of providing such huge computational power.
Recently, the supercomputing arena has witnessed
phenomenal growth of commodity clusters built using InfiniBand interconnects
and multi-core systems. InfiniBand is a high performance interconnect providing
low latency and high bandwidth.
Message Passing Interface (MPI) is a popular model to write applications for
such machines. Therefore, it is important to optimize MPI for these emerging
clusters.
InfiniBand architecture allows for varying implementations of the network
protocol stack. For example, the protocol can be totally on-loaded to the host
processing core or it can be off-loaded onto the NIC processor or can use a
combination of the two. Understanding the
characteristics of these different implementations is critical in optimizing
MPI. In this thesis, we systematically study some of these
architectures which are commercially available. Based on their
characteristics, we propose communication algorithms for one of the most
extensively used collective operation, MPI_Alltoall. We also redesign the point-to-point
rendezvous protocol for offload network interfaces to allow for overlap of
communication and computation.
The designs developed as part of this thesis are available in MVAPICH,
which is a popular open-source implementation of MPI over InfiniBand and is
used by several hundred top computing sites all around the world.