Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Handling Soft and Hard Errors for Scientific Applications

Abstract Details

2017, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Due to the rapid decrease in Mean Time Between Failure (MTBF) in High Performance Computing, fault tolerance emerged as a critical topic to improve overall performance in the HPC community. In recent decades, along with the decrease in size of hardware, and the extensively used near-threshold computation for energy saving, the community is now facing more frequent soft errors than ever. Particularly, due to the difficulty in detecting soft errors, we are in urgent need for a general solution for these errors.Our work includes providing efficient and effective solution to handle soft and hard errors for parallel system. We start from solving the write bottleneck of the traditional checkpoint and restart. We exploit the communication structure to find locally finalized data, as well as each process's contribution to globally finalized data. We allow each node to take independent checkpoint using this information and therefore achieve uncoordinated checkpointing. We checkpoint asynchronously by overlapping the workload of checkpoint with computation, so that the system avoids write congestion. We discovered that the soft error impact in convergent iterative applications' output follows a pattern. We developed a signature analysis based detection with checkpointing based recovery, which is driven by the observation that high order bit flips can very negatively impact execution, but can also be easily detected. Specifically, we have developed signatures for this class of applications. For non-monotonically convergent applications, we observed that the signature of silent data corruption is specific to an application but independent of the input dataset size for the application. Based on this observation, we explored an approach that involves machine learning technique to detect soft errors. We use off-line training framework of machine learning, construct classifiers with representative inputs and periodically invoke the classifiers during execution to verify the status. Our work not only focuses on optimizing the existing fault tolerance solution to handle general case of faults, but also includes exploring new algorithms that detects and recovers from soft errors. We proposed an algorithm level fault tolerance solution for molecular dynamic applications to detect soft errors and recover from the error. We also developed an algorithm level recovery strategy, so that the applications do not need traditional checkpoint to back up the computation state. Finally, we supported in-situ analysis paradigm with fault resilience. We explored a Map-Reduce like platform for in-situ analysis and discovered the possibility of achieving runtime execution state by utilizing the redundant properties of reduction objects during computation. With the state stored in the shared locations among the nodes, we could maintain a checkpoint-restart like mechanism and the system could restart from any previous backup if any node fails. We were able to apply the approach both time-wise and space-wise for the Smart with reasonable extra overhead.
Gagan Agrawal (Advisor)
Mircea-Radu Teodorescu (Committee Member)
P. Sadayappan (Committee Member)
175 p.

Recommended Citations

Citations

  • Liu, J. (2017). Handling Soft and Hard Errors for Scientific Applications [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1483632126075067

    APA Style (7th edition)

  • Liu, Jiaqi. Handling Soft and Hard Errors for Scientific Applications. 2017. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1483632126075067.

    MLA Style (8th edition)

  • Liu, Jiaqi. "Handling Soft and Hard Errors for Scientific Applications." Doctoral dissertation, Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1483632126075067

    Chicago Manual of Style (17th edition)