Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
Dissertation.pdf (8.4 MB)
ETD Abstract Container
Abstract Header
Handling Soft and Hard Errors for Scientific Applications
Author Info
Liu, Jiaqi
ORCID® Identifier
http://orcid.org/0000-0003-4617-6960
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=osu1483632126075067
Abstract Details
Year and Degree
2017, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Abstract
Due to the rapid decrease in Mean Time Between Failure (MTBF) in High Performance Computing, fault tolerance emerged as a critical topic to improve overall performance in the HPC community. In recent decades, along with the decrease in size of hardware, and the extensively used near-threshold computation for energy saving, the community is now facing more frequent soft errors than ever. Particularly, due to the difficulty in detecting soft errors, we are in urgent need for a general solution for these errors.Our work includes providing efficient and effective solution to handle soft and hard errors for parallel system. We start from solving the write bottleneck of the traditional checkpoint and restart. We exploit the communication structure to find locally finalized data, as well as each process's contribution to globally finalized data. We allow each node to take independent checkpoint using this information and therefore achieve uncoordinated checkpointing. We checkpoint asynchronously by overlapping the workload of checkpoint with computation, so that the system avoids write congestion. We discovered that the soft error impact in convergent iterative applications' output follows a pattern. We developed a signature analysis based detection with checkpointing based recovery, which is driven by the observation that high order bit flips can very negatively impact execution, but can also be easily detected. Specifically, we have developed signatures for this class of applications. For non-monotonically convergent applications, we observed that the signature of silent data corruption is specific to an application but independent of the input dataset size for the application. Based on this observation, we explored an approach that involves machine learning technique to detect soft errors. We use off-line training framework of machine learning, construct classifiers with representative inputs and periodically invoke the classifiers during execution to verify the status. Our work not only focuses on optimizing the existing fault tolerance solution to handle general case of faults, but also includes exploring new algorithms that detects and recovers from soft errors. We proposed an algorithm level fault tolerance solution for molecular dynamic applications to detect soft errors and recover from the error. We also developed an algorithm level recovery strategy, so that the applications do not need traditional checkpoint to back up the computation state. Finally, we supported in-situ analysis paradigm with fault resilience. We explored a Map-Reduce like platform for in-situ analysis and discovered the possibility of achieving runtime execution state by utilizing the redundant properties of reduction objects during computation. With the state stored in the shared locations among the nodes, we could maintain a checkpoint-restart like mechanism and the system could restart from any previous backup if any node fails. We were able to apply the approach both time-wise and space-wise for the Smart with reasonable extra overhead.
Committee
Gagan Agrawal (Advisor)
Mircea-Radu Teodorescu (Committee Member)
P. Sadayappan (Committee Member)
Pages
175 p.
Subject Headings
Computer Engineering
;
Computer Science
Keywords
hard error
;
soft error
;
scientific application
;
fault tolerance
;
resilience
;
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Liu, J. (2017).
Handling Soft and Hard Errors for Scientific Applications
[Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1483632126075067
APA Style (7th edition)
Liu, Jiaqi.
Handling Soft and Hard Errors for Scientific Applications.
2017. Ohio State University, Doctoral dissertation.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=osu1483632126075067.
MLA Style (8th edition)
Liu, Jiaqi. "Handling Soft and Hard Errors for Scientific Applications." Doctoral dissertation, Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1483632126075067
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
osu1483632126075067
Download Count:
580
Copyright Info
© 2017, all rights reserved.
This open access ETD is published by The Ohio State University and OhioLINK.