Skip to Main Content
 

Global Search Box

 
 
 

ETD Abstract Container

Abstract Header

Enabling Efficient Parallelism for Applications with Dependences and Irregular Memory Accesses

Abstract Details

2019, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
As Moore's law slows down and the frequency of processors reaches a bottleneck, parallelism has become central to the performance improvement of computer systems. While more and more parallel units are being incorporated into computer architectures in the form of both SIMD and MIMD, exploiting the massive parallelism to accelerate the applications is usually nontrivial. In this work, we focus on addressing the challenges in achieving efficient parallelism for applications with dependences and irregular memory accesses, which were traditionally considered difficult to benefit from parallelism. The first problem that impedes efficient parallelization of certain applications is the inherent dependences in the computation. Finite State Machine (FSM) is such an example that has intrinsically sequential computation due to the tight dependences across iterations. We present a technique called enumerative speculation to break the dependences on the state variable. By combining the enumeration and speculation methods, we achieve high speculation success rates and high utilization of the SIMD units in the hardware. For more general loops that involve dependences across iterations, we observe that many of them can be parallelized as scans or reductions if modeled as associative combinations of recurrence functions. We present a function reconstruction method to reveal the hidden scan or reduction patterns in such loops and effectively parallelize them across iterations. Another problem that prevents many applications from exploiting fine-grain parallelism is the irregular memory accesses. Particle simulations, unstructured grid computation, sparse matrix multiplication and iterative graph algorithms are examples of such applications. Previous works mainly rely on tiling and data reorganization to improve data locality and resolve the data conflicts in SIMD processing of these applications. However, the overhead of data reorganization sometimes can be so high that it denies the benefit of SIMD processing itself. We provide a technique that reuses the data reorganization information to amortize its overhead for a class of dynamic or adaptive irregular applications. We also present a specific vectorization method called bucketized hashing with offsets to accelerate hash-based aggregation that has the data conflicts problem but cannot afford any data reorganization because the computation only has one iteration. For emerging platforms, we further present a technique called in-vector reduction that utilizes the new conflict-detection feature in Intel AVX-512 instruction set to resolve the data conflicts in associative irregular applications and achieve high SIMD utilization with none or little data reorganization overhead. Sparse matrix multiplication represents another important class of irregular applications. A particular type of sparse matrix computation that involves multiplying a sparse matrix with one or more dense matrices (e.g., SpMM and SDDMM) is emerging in recent years due to its importance in machine learning and data mining applications. We observe that current techniques for accelerating SpMM and SDDMM can deliver good performance only when the sparse matrix has a clustered structure. Based on the observation, we proposed a clustering-based row-reordering technique to further improve the performance of SpMM and SDDMM on GPUs. Communication overhead is another critical issue that impedes efficient parallelization. Data-parallel SGD, which is a widely used parallel algorithm for training machine learning models, is an example of such applications. We studied the convergence property of SGD with sparse and quantized communication. Based on a theoretical analysis, we demonstrated how to configure the parameters for gradient sparsification and quantization so that the asymptotic convergence rate of SGD can be preserved. We also proposed two practical algorithms to reduce the communication overhead without impairing convergence rate of a full-communication SGD for distributed systems with both slow and relatively fast connections.
Gagan Agrawal (Advisor)
Qin Fen (Committee Member)
Radu Teodorescu (Committee Member)
344 p.

Recommended Citations

Citations

  • Jiang, P. (2019). Enabling Efficient Parallelism for Applications with Dependences and Irregular Memory Accesses [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1563461701172592

    APA Style (7th edition)

  • Jiang, Peng. Enabling Efficient Parallelism for Applications with Dependences and Irregular Memory Accesses. 2019. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1563461701172592.

    MLA Style (8th edition)

  • Jiang, Peng. "Enabling Efficient Parallelism for Applications with Dependences and Irregular Memory Accesses." Doctoral dissertation, Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1563461701172592

    Chicago Manual of Style (17th edition)