Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Mitigating Distributed Configuration Errors in Cloud Systems

Abstract Details

2022, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
While many techniques have been proposed to find software configuration errors in software systems, most of them focus on finding misconfiguration occurring on a single node. Unfortunately, the nature of distributed systems brings up a more complex problem: some failures may only occur when a system is configured inappropriately on multiple nodes, whereas the configuration of each node is considered correct individually. To distinguish these configuration errors from local configuration errors which have been widely studied, we call these errors as distributed configuration errors. In this dissertation, we combat distributed configuration errors in two ways: 1) we re-design the system to reduce the chance that the administrator may introduce an inappropriate distributed configuration; 2) we use the traditional software testing approach to test what distributed configurations are unsafe. In the first direction, we focus on timeout, an important parameter that is hard to configure right. We propose SafeTimer, a mechanism to enhance existing timeout failure detection protocols to tolerate long delays in the OS and the application: at the heartbeat receiver, SafeTimer checks whether there are any pending heartbeats before reporting a failure; at the heartbeat sender, SafeTimer blocks the sender if it cannot send out heartbeats in time. As a result, as long as networking delays are bounded, SafeTimer can guarantee the correctness of failure detection. We applied SafeTimer to HDFS and Ceph with little modification, and found the performance overhead is small. In the second direction, we propose ZebraConf, a testing framework that reuses existing unit tests and integration tests to test whether a parameter can be configured in a heterogeneous manner. To address the challenge of assigning different configurations to different nodes in unit tests, ZebraConf incorporates several heuristics to accurately map configuration objects to nodes. To reduce the massive test number, ZebraConf profiles unit test suites to only generate effective tests and groups multiple tests into a single one. We applied ZebraConf to five cloud systems and found 47 heterogeneous-unsafe configuration parameters.
Yang Wang, Dr (Advisor)
Michael Bond, Dr (Committee Member)
Xiaoyi Lu, Dr (Committee Member)
Kannan Srinivasan, Dr (Committee Member)
Feng Qin, Dr (Committee Member)

Recommended Citations

Citations

  • Ma, S. (2022). Mitigating Distributed Configuration Errors in Cloud Systems [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu164912259816919

    APA Style (7th edition)

  • Ma, Sixiang. Mitigating Distributed Configuration Errors in Cloud Systems. 2022. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu164912259816919.

    MLA Style (8th edition)

  • Ma, Sixiang. "Mitigating Distributed Configuration Errors in Cloud Systems." Doctoral dissertation, Ohio State University, 2022. http://rave.ohiolink.edu/etdc/view?acc_num=osu164912259816919

    Chicago Manual of Style (17th edition)