Developing high-performance parallel applications that use linked
data structures on distributed-memory clusters is challenging. Many
scientific applications use algorithms based on linked data
structures like trees and graphs. These structures are especially
useful in representing relationships between data which may not be
known until runtime or may otherwise evolve during the course of a
computation. Methods such as n-body simulation, Fast Multipole
Methods (FMM), and multiresolution analysis all use trees to represent
a fixed space populated by a dynamic distribution of elements. Other
problem domains, such as data mining, use both trees and graphs to
summarize large input datasets into a set of relationships that
capture the information in a form that lends itself to efficient
mining.
This dissertation first describes a runtime system that provides a
programming interface to a global address space representation of
generalized distributed linked data structures, while providing
scalable performance on distributed memory computing systems. This
system, the Global Chunk Layer (GCL), provides data access
primitives at the element level, but takes advantage of coarse-grained
data movement to enhance locality and improve communication
efficiency. The key benefits of using the GCL system include efficient
shared-memory style programming of distributed dynamic, linked data
structures, the abstraction and optimization of structural elements
common to linked data, and the ability to customize many aspects of
the runtime to tune application performance.
Additionally, this dissertation presents the design and implementation
of a tree-specific system for efficient parallel global address space
computing. The Global Trees (GT) library provides a global view of
distributed linked tree structures and a set of routines that operate
on these structures. GT is built on top of the generalized data
structure support provided by the GCL runtime and can inter-operate
with other parallel programming models such as MPI, or along with
existing global view approaches such as Global Arrays. This approach
is based on two key insights: First, tree-based algorithms are easily
expressed in a fine-grained manner, but data movement must be done at
a much coarser level of granularity for good performance. Second,
since GT is focused on a single data abstraction, attributes unique to
tree structures can be exploited to provide optimized routines for
common operations.
This dissertation also describes techniques for improving program
performance and program understanding using these frameworks. Data
locality has a significant impact on the communication properties of
parallel algorithms. Techniques are presented that use profile-driven
data reference traces to perform two types of data layout in Global
Trees programs. Lastly, approaches for understanding and analyzing
program performance data are presented along with tools for
visualizing GT structures. These tools enable GT developers to better
understand and optimize program performance.