Modern accelerators and multi-core architectures offer significant
computing power at a very modest cost.
With this trend, an important research issue at the software end
is how to make the best use of these computing devices, and how to enable
high performance without the users having to put too much effort into learning the
architecture and the programming model.
Our goal is to address the above problem by developing automatic
code generation systems, particularly for GPUs and
GPU clusters. We believe that by focusing on specific application
classes, the task of automatic code generation can be significantly
simplified. Thus, we made efforts in providing code generation and optimization systems for two classes of applications: data-intensive applications with generalized reductions, and tensor contraction functions. First, we focused on a class
of data-intensive applications, whose processing structure is
of generalized reductions.
In the code generation systems we have built, the user input are algorithms written in high-level
languages, specifically, C or MATLAB. Program analysis and code generation is
performed to generate code for a single GPU, or a GPU cluster.
The three specific systems we have built are
GREENRIDE, a code
generation system to provide GPU support for C programs; GMAT-DM, which
translates MATLAB code into GPU executable program; and AUTO-GC, which provides GPU support
for clusters, by incorporating code generation for FREERIDE, which
is a middleware supporting parallel execution for data mining.
For tensor contractions, we evaluated the automatically generated code on different GPUs, and made investigation in the algorithm optimization for each card. It led to an auto-tuning framework which selects algorithms and parameters according to some cost model and thresholds extracted from simple micro-benchmarks. We also developed a loop transformation system in the environment of multi-level memory hierarchy. By focusing on the dominating factors of the computation, we were able to remove a large portion of extra data movement between memory hierarchies.
In future, we plan to extend our work in the following directions. The code generation system for data intensive applications with reduction patterns could be applied and optimized for other classes of applications. The integer programming model could also be used for other architectures, including future accelerators. We would like to consider heterogeneous systems for the loop transformation approach. The auto-tuning framework will be extended to include more parameters, enabling better performance gain.