Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Benchmarking and Accelerating TensorFlow-based Deep Learning on Modern HPC Systems

Abstract Details

2018, Master of Science, Ohio State University, Computer Science and Engineering.
Google’s TensorFlow is one of the most popular Deep Learning (DL) frameworks available in the community. gRPC, a Remote Procedure Call (RPC) framework also developed by Google, is the main communication engine for distributed TensorFlow. TensorFlow primarily uses gRPC for exchanging tensors and communicating administrative tasks among different processes across the nodes. Tensor updates during the training phase are communication intensive and thus TensorFlow’s performance is heavily dependent on the underlying network and the efficacy of the communication engine. Apart from the default gRPC channel, TensorFlow supports various high-performance channels to efficiently transfer tensors such as gRPC+Verbs and gRPC+MPI. However, at present, the community lacks a thorough characterization of these available distributed TensorFlow communication channels. This is critical to understand because high-performance Deep Learning with TensorFlow on modern HPC systems needs an efficient communication runtime. In this work, we first conduct a meticulous analysis of the communication characteristics of distributed TensorFlow over all available channels. Based on these characteristics we propose TF-gRPC-Bench micro-benchmark suite that enables system researches to quickly understand the impact of the underlying network and communication runtime on DL workloads. We propose three micro-benchmarks that take account TensorFlow DL workload characteristics over gRPC. Furthermore, our characterization shows that none of the existing channels in TensorFlow can support adaptive and efficient communication for DL workloads with different message sizes. Moreover, the community needs to maintain these different channels while the users are also expected to tune these channels to get the desired performance. Therefore, this work proposes a unified approach to have a single gRPC runtime (i.e., AR-gRPC) in TensorFlow with Adaptive and efficient RDMA protocols. In AR-gRPC, we propose designs such as hybrid communication protocols, message pipelining and coalescing, zero-copy transmission etc. to make our runtime be adaptive to different message sizes for DL workloads. Our evaluations show that AR-gRPC can significantly speedup gRPC performance by up to 4.1x and 2.3x compared to the default gRPC design on IPoIB and another RDMA-based gRPC design in the community. By integrating our AR-gRPC with TensorFlow, we can achieve up to 3x distributed training performance improvement over default gRPC-IPoIB based TensorFlow.
Dhabaleswar K. Panda (Advisor)
Christopher Stewart (Committee Member)
Xiaoyi Lu (Committee Member)
84 p.

Recommended Citations

Citations

  • Biswas, R. (2018). Benchmarking and Accelerating TensorFlow-based Deep Learning on Modern HPC Systems [Master's thesis, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1531827968620294

    APA Style (7th edition)

  • Biswas, Rajarshi. Benchmarking and Accelerating TensorFlow-based Deep Learning on Modern HPC Systems. 2018. Ohio State University, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1531827968620294.

    MLA Style (8th edition)

  • Biswas, Rajarshi. "Benchmarking and Accelerating TensorFlow-based Deep Learning on Modern HPC Systems." Master's thesis, Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu1531827968620294

    Chicago Manual of Style (17th edition)