Because of the bottleneck in the increase of clock frequency, multi-cores emerged as
a way of improving the overall performance of CPUs. In the recent decade, many-cores
begin to play a more and more important role in scientific computing. The highly cost-
effective nature of many-cores makes them extremely suitable for data-intensive computa-
tions. Specifically, many-cores are in the forms of GPUs (e.g., NVIDIA or AMD GPUs)
and more recently, coprocessers (Intel MIC). Even though these highly parallel architec-
tures offer significant amount of computation power, it is very hard to program them, and
harder to fully exploit the computation power of them. Combing the power of multi-cores
and many-cores, i.e., making use of the heterogeneous cores is extremely complicated.
Our efforts have been made on performing optimizations to important sets of appli-
cations on such parallel systems. We address this issue from the perspective of commu-
nication patterns. Scientific applications can be classified based on the properties (com-
munication patterns), which have been specified in the Berkeley Dwarfs many years ago.
By investigating the characteristics of each class, we are able to derive efficient execution
strategies, across different levels of the parallelism. We design a high-level programming
API, as well as implement an efficient runtime system with pattern-specific optimization-
s, considering the characteristics of the hardware platform. Thus, instead of providing a
general programming model, we provide separate APIs for each communication pattern.
We have worked on a selected subset of the communication patterns, including MapRe-
duce, generalized reductions, irregular reductions, stencil computations and graph process-
ing. Our targeted platforms are single GPUs, coupled CPU-GPUs, heterogeneous clusters,
and Intel Xeon Phis. Our work not only focuses on efficiently executing a communication
pattern on a single multi-core or many-core, but also considers inter-device and inter-node
task scheduling. While implementing a specific communication pattern, we consider as-
pects including lock-reducing, data locality, and load balancing.
Our work starts with the optimization of the MapReduce on a single GPU, specifically
aiming to efficiently utilize the shared memory. We design a reduction based approach,
which is able to keep the memory consumption low by avoiding the storage of intermediate
key-value pairs. To support such an approach, we design a general data structure, referred
to as the reduction object, which is placed in the memory hierarchy of the GPU. The limited
memory requirement of the reduction object allows us to extensively utilize the small but
fast shared memory. Our approach performs well for a popular set of MapReduce appli-
cations, especially the reduction intensive ones. The comparison with former state-of-art
accelerator based approaches shows that our approach is much more efficient at utilizing
the shared memory.
Even though MapReduce significantly reduces the complexity of parallel programming,
it is not easy to achieve efficient execution, for complicated applications, on heterogeneous
clusters with multi-core and multiple GPUs within each node. In view of this, we design
a programming framework, which aims to reduce the programming difficulty, as well as
provide automatic optimizations to applications. Our approach is to classify applications
based on communication patterns. The patterns we study include Generalized Reductions,
Irregular Reductions and Stencil Computations, which are important ones that are frequent-
ly used in scientific and data intensive computations. For each pattern, we design a simple
API, as well as a runtime with pattern-specific optimizations at different parallelism levels.
Besides, we also investigate graph applications. We design a graph processing system
over the Intel Xeon Phi and CPU. We design a vertex-centric programming API, and a
novel condensed static message buffer that supports less memory consumption and SIMD
message reduction. We also use a pipelining scheme to avoid frequent locking. The hybrid
graph partitioning is able to achieve load balance between CPU and Xeon Phi, as well as
to reduce the communication overhead.
Executing irregular applications on SIMD architectures is always challenging. The ir-
regularity leads to problems including poor data access locality, data dependency, as well
as inefficient utilization of SIMD lanes. We propose a general optimization methodolo-
gy for irregular applications, including irregular reductions, graph algorithms and sparse
matrix matrix multiplications. The key observation of our approach is that the major data
structures accessed by irregular applications can be treated as sparse matrices. The steps of
our methodology include: matrix tiling, data access pattern identification, and conflict re-
moval. As a consequence, our approach is able to efficiently utilize both SIMD and MIMD
parallelism on the Intel Xeon Phi.