Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors

Chen, Linchuan

Keyword Search

School Logo

Dissertation.pdf (2.56 MB)

Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors

Author Info

Chen, Linchuan

ORCID® Identifier

http://orcid.org/0000-0002-8527-5841

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=osu1435407747

Year and Degree

2015, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.

Abstract

Because of the bottleneck in the increase of clock frequency, multi-cores emerged as a way of improving the overall performance of CPUs. In the recent decade, many-cores begin to play a more and more important role in scientific computing. The highly cost- effective nature of many-cores makes them extremely suitable for data-intensive computa- tions. Specifically, many-cores are in the forms of GPUs (e.g., NVIDIA or AMD GPUs) and more recently, coprocessers (Intel MIC). Even though these highly parallel architec- tures offer significant amount of computation power, it is very hard to program them, and harder to fully exploit the computation power of them. Combing the power of multi-cores and many-cores, i.e., making use of the heterogeneous cores is extremely complicated. Our efforts have been made on performing optimizations to important sets of appli- cations on such parallel systems. We address this issue from the perspective of commu- nication patterns. Scientific applications can be classified based on the properties (com- munication patterns), which have been specified in the Berkeley Dwarfs many years ago. By investigating the characteristics of each class, we are able to derive efficient execution strategies, across different levels of the parallelism. We design a high-level programming API, as well as implement an efficient runtime system with pattern-specific optimization- s, considering the characteristics of the hardware platform. Thus, instead of providing a general programming model, we provide separate APIs for each communication pattern. We have worked on a selected subset of the communication patterns, including MapRe- duce, generalized reductions, irregular reductions, stencil computations and graph process- ing. Our targeted platforms are single GPUs, coupled CPU-GPUs, heterogeneous clusters, and Intel Xeon Phis. Our work not only focuses on efficiently executing a communication pattern on a single multi-core or many-core, but also considers inter-device and inter-node task scheduling. While implementing a specific communication pattern, we consider as- pects including lock-reducing, data locality, and load balancing. Our work starts with the optimization of the MapReduce on a single GPU, specifically aiming to efficiently utilize the shared memory. We design a reduction based approach, which is able to keep the memory consumption low by avoiding the storage of intermediate key-value pairs. To support such an approach, we design a general data structure, referred to as the reduction object, which is placed in the memory hierarchy of the GPU. The limited memory requirement of the reduction object allows us to extensively utilize the small but fast shared memory. Our approach performs well for a popular set of MapReduce appli- cations, especially the reduction intensive ones. The comparison with former state-of-art accelerator based approaches shows that our approach is much more efficient at utilizing the shared memory. Even though MapReduce significantly reduces the complexity of parallel programming, it is not easy to achieve efficient execution, for complicated applications, on heterogeneous clusters with multi-core and multiple GPUs within each node. In view of this, we design a programming framework, which aims to reduce the programming difficulty, as well as provide automatic optimizations to applications. Our approach is to classify applications based on communication patterns. The patterns we study include Generalized Reductions, Irregular Reductions and Stencil Computations, which are important ones that are frequent- ly used in scientific and data intensive computations. For each pattern, we design a simple API, as well as a runtime with pattern-specific optimizations at different parallelism levels. Besides, we also investigate graph applications. We design a graph processing system over the Intel Xeon Phi and CPU. We design a vertex-centric programming API, and a novel condensed static message buffer that supports less memory consumption and SIMD message reduction. We also use a pipelining scheme to avoid frequent locking. The hybrid graph partitioning is able to achieve load balance between CPU and Xeon Phi, as well as to reduce the communication overhead. Executing irregular applications on SIMD architectures is always challenging. The ir- regularity leads to problems including poor data access locality, data dependency, as well as inefficient utilization of SIMD lanes. We propose a general optimization methodolo- gy for irregular applications, including irregular reductions, graph algorithms and sparse matrix matrix multiplications. The key observation of our approach is that the major data structures accessed by irregular applications can be treated as sparse matrices. The steps of our methodology include: matrix tiling, data access pattern identification, and conflict re- moval. As a consequence, our approach is able to efficiently utilize both SIMD and MIMD parallelism on the Intel Xeon Phi.

Committee

Gagan Agrawal (Advisor)
P. Sadayappan (Committee Member)
Feng Qin (Committee Member)

Pages

228 p.

Subject Headings

Computer Science

Keywords

accelerators; pattern-specific optimizations

Chen, L. (2015). Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1435407747
APA Style (7th edition)
Chen, Linchuan. Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors. 2015. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1435407747.
MLA Style (8th edition)
Chen, Linchuan. "Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors." Doctoral dissertation, Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1435407747
Chicago Manual of Style (17th edition)

Document number:

osu1435407747

Download Count:

1,100

Copyright Info

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations