Graphics Processing Units (GPUs) are massively parallel, many-coreprocessors with tremendous computational power and very high memory
bandwidth. GPUs are primarily designed for accelerating 3D graphics
applications on modern computer systems and are therefore, specialized
for highly data parallel, compute intensive problems, unlike
general-purpose CPUs. In recent times, there has been significant
interest in finding ways to accelerate general purpose (non-graphics),
data parallel computations using the high processing power of GPUs.
General-purpose Programming on GPUs (GPGPU) was earlier considered
difficult because the only available techniques to program the GPUs were
graphics-specific programming models such as OpenGL and DirectX.
However, with the advent of GPGPU programming models such as NVIDIA's
CUDA and the new standard OpenCL, GPGPU has become mainstream.
Optimizations performed by the compiler play a very important role in
improving the performance of computer programs. While compiler
optimizations for CPUs have been researched for many decades now, the
arrival of GPGPU, and it's differences in architecture and programming
model, has brought along with it many new opportunities for compiler
optimizations. One such classical optimization is 'Loop Unrolling'. Loop
unrolling has proven to be a relatively inexpensive and beneficial
optimization for CPU programs. However, current GPGPU compilers perform
little to no loop unrolling.
In this thesis, we attempt to understand the impact of loop unrolling on
GPGPU programs and using this understanding, we develop a
semi-automatic, compile-time approach for identifying optimal unroll
factors for suitable loops in GPGPU programs. In addition, we also
propose techniques for reducing the number of unroll factors evaluated,
based on the characteristics of the program being compiled and the
device being compiled to. We use these techniques to evaluate the
effect of loop unrolling on a range of GPGPU programs and show that we
correctly identify the optimal unroll factors, and that these optimized
versions achieve speedups of up to nearly 1.5, relative to the unoptimized
version.