Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters

Raja Chandrasekar, Raghunath

Abstract Details

2014, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
In high-performance computing (HPC), tightly-coupled, parallel applications run in lock-step over thousands to millions of processor cores. These applications simulate a wide-range of scientific phenomena, such as hurricanes and earthquakes, or the functioning of a human heart. The results of these simulations are important and time-critical, e.g., we want to know the path of the hurricane before it makes landfall. Thus, these applications are run on the fastest supercomputers in the world at the largest scales possible. However, due to the increased component count, large-scale executions are more prone to experience faults, with Mean Times Between Failures (MTBF) on the order of hours or days due to hardware breakdowns and soft errors. A vast majority of current-generation HPC systems and application codes work around system failures using rollback-recovery schemes, also known as Checkpoint-Restart (CR), wherein the parallel processes of an application frequently save a mutually agreed-upon state of their execution as checkpoints in a globally-shared storage medium. In the face of failures, applications rollback their execution to a fault-free state using these snapshots that were saved periodically. Over the years, checkpointing mechanisms have gained notoriety for their colossal I/O demands. While state-of-art parallel file systems are optimized for concurrent accesses from millions of processes, checkpointing overheads continue to dominate application run times, with the time taken to write a single checkpoint taking on the order of tens of minutes to hours. On future systems, checkpointing activities are predicted to dominate compute time and overwhelm file system resources. On supercomputing systems geared for Exascale, parallel applications will have a wider range of storage media to choose from - on-chip/off-chip caches, node-level RAM, Non-Volatile Memory (NVM), distributed-RAM, flash-storage (SSDs), HDDs, parallel file systems, and archival storage. Current-generation checkpointing middleware and frameworks are oblivious to this hierarchy in storage where each medium has unique performance and data-persistence characteristics. This thesis proposes a cross-layer framework that leverages this hierarchy in storage media, to design scalable and low-overhead fault-tolerance mechanisms that are inherently I/O bound. The key components of the framework include - \textit{CRUISE}, a highly-scalable in-memory checkpointing system that leverages both volatile and Non-Volatile Memory technologies; \textit{Stage-FS}, a light-weight data-staging system that leverages burst-buffers and SSDs to asynchronously move application snapshots to a remote file system; Stage-QoS, a file system agnostic Quality-of-Service mechanism for data-staging systems that minimizes network contention; \textit{MIC-Check}, a distributed checkpoint-restart system for coprocessor-based supercomputing systems; \textit{Power-Check}, an energy-efficient checkpointing framework for transparent and application-aware HPC checkpointing systems; and \textit{FTB-IPMI}, an out-of-band fault-prediction mechanism that pro-actively monitors for failures. The components of this framework have been evaluated up to a scale of three million compute processes, have reduced the checkpointing overhead on scientific applications by a factor of 30, and reduced the amount of energy consumed by checkpointing systems by up to 48\%.
Dhabaleswar Panda (Advisor)
Ponnuswamy Sadayappan (Committee Member)
Radu Teodorescu (Committee Member)
Kathryn Mohror (Committee Member)
198 p.

Recommended Citations

Citations

  • Raja Chandrasekar, R. (2014). Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1417733721

    APA Style (7th edition)

  • Raja Chandrasekar, Raghunath. Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters. 2014. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1417733721.

    MLA Style (8th edition)

  • Raja Chandrasekar, Raghunath. "Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters." Doctoral dissertation, Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1417733721

    Chicago Manual of Style (17th edition)