Multicore processors have now become mainstream. The difficulty ofprogramming these architectures to effectively tap the potential of
multiple processing units is well-known. Among several ways of
addressing this issue, one of the very promising and simultaneously hard
approaches is Automatic Parallelization. This approach does not
require any effort on part of the programmer in the process of
parallelizing a program.
The Polyhedral model for compiler optimization is a powerful
mathematical framework based on parametric linear algebra and
integer linear programming. It provides an abstraction to represent
nested loop computation and its data dependences using integer
points in polyhedra. Complex execution-reordering, that can improve
performance by parallelization as well as locality enhancement, is
captured by affine transformations in the polyhedral model. With
several recent advances, the polyhedral model has reached a level of
maturity in various aspects – in particular, as a powerful
intermediate representation for performing transformations, and code
generation after transformations. However, an approach to
automatically find good transformations for communication-optimized
coarse-grained parallelization together with locality optimization
has been a key missing link. This dissertation presents a new
automatic transformation framework that solves the above problem.
Our approach works by finding good affine transformations through a
powerful and practical linear cost function that enables efficient
tiling and fusion of sequences of arbitrarily nested loops. This in
turn allows simultaneous optimization for coarse-grained parallelism
and locality. Synchronization-free parallelism and pipelined
parallelism at various levels can be extracted. The framework can be
targeted to different parallel architectures, like general-purpose
multicores, the Cell processor, GPUs, or embedded multiprocessor
SoCs.
The proposed framework has been implemented into a new end-to-end
transformation tool, PLUTO, that can automatically generate parallel
code from regular C program sections. Experimental results from the
implemented system show significant performance improvement for
single core and multicore execution over state-of-the-art research
compiler frameworks as well as the best native production compilers.
For several dense linear algebra kernels, code generated from Pluto
beats, by a significant margin, the same kernels implemented with
sequences of calls to highly-tuned libraries supplied by vendors.
The system also allows empirical optimization to be performed in a
much wider context than has been attempted previously. In addition,
Pluto can serve as the parallel code generation backend for several
high-level domain-specific languages.