Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Combating Fault Tolerance Bugs in Cloud Systems

Abstract Details

2021, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Cloud systems are playing an increasingly important role in our daily life. Therefore, the dependability of cloud systems becomes more important than ever. At the scale of cloud systems, both in hardware size and in software complexity, faults (e.g., network partitions) are inevitable. Hence, a dependable cloud system strives to tolerate faults correctly. Despite developers’ efforts in designing and implementing fault tolerant cloud systems, faults can still trigger software bugs in the fault tolerance routines and lead to cloud system failures. This dissertation uses fault tolerance bugs, or FTBugs, to denote the bugs in the fault tolerance routines. To address FTBugs in cloud systems, this dissertation proposes approaches to expose and detect FTBugs based on a comprehensive study of FTBugs in real-world cloud systems. The first contribution of this dissertation is a comprehensive study of an important type of FTBugs in real-world cloud systems. Specifically, the study focuses on eBugs, i.e., the bugs in using exception mechanism for fault tolerance. The study analyzes the triggering conditions, the root causes, and the failure symptoms of 210 eBugs that are selected from six popular cloud systems. More importantly, this is the first study that analyzes the relation between the triggering conditions and the root causes of FTBugs. The study shows that eBugs are severe in cloud systems. More crucially, the study also reveals interesting findings that can help effectively expose eBugs. Finally, the study finds the triggering conditions useful for detecting eBugs in cloud systems. The second contribution of this dissertation is two techniques for detecting an important type of FTBugs. When performing fault tolerance through exception mechanism, it is crucial that the propagated exception accurately represents the triggering fault. An inaccurate exception eBug occurs when an exception inaccurately represents its triggering fault. Inaccurate exceptions can affect cloud system dependability by sabotaging the fault tolerance routines. To detect inaccurate exceptions, this dissertation proposes two techniques called DIET and DECAF. DIET employs a supervised approach: It detects inaccurate exceptions by checking whether the class and the error message of an exception imply different types of faults. On the contrary, DECAF employs an unsupervised approach: It detects inaccurate exceptions by checking whether the class, the error message, and the program context of an exception rarely co-appear on an exception. Experiments with popular cloud systems show that both DIET and DECAF are effective in detecting inaccurate exceptions with different trade-offs. The third contribution of this dissertation is a technique for exposing an important type of FTBugs. Network partitions are inevitable in cloud systems. Although cloud systems strive to tolerate network partitions, network partitions can still trigger FTBugs. To address this problem, this dissertation proposes a fault injection technique called CoFI. Based on the observation that bugs triggered by network partitions, i.e., partition bugs, are more likely to occur in inconsistent system states, CoFI controls the timing of the network partition to systematically test a cloud system in inconsistent states. Experiments with popular cloud systems show that CoFI is effective in exposing partition bugs.
Feng Qin (Advisor)
Radu Teodorescu (Committee Member)
Yang Wang (Committee Member)
145 p.

Recommended Citations

Citations

  • Chen, H. (2021). Combating Fault Tolerance Bugs in Cloud Systems [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1608576410307739

    APA Style (7th edition)

  • Chen, Haicheng. Combating Fault Tolerance Bugs in Cloud Systems. 2021. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1608576410307739.

    MLA Style (8th edition)

  • Chen, Haicheng. "Combating Fault Tolerance Bugs in Cloud Systems." Doctoral dissertation, Ohio State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=osu1608576410307739

    Chicago Manual of Style (17th edition)