Addressing Configuration Errors / Yuanyuan Zhou
Failures are a fact of life in today’s large-scale, rapid- changing systems in cloud and data centers. To mitigate the impact of failures, tolerance and recovery mechanisms have been widely adopted, such as employing data and node redundancy, as well as supporting fast rebooting and rollback. While these mechanisms are successful in handling individual machine failures (e.g., hardware faults and memory bugs), they are less effective in handling configuration errors, especially the errors in configurations that control the failure handling itself. Moreover, very often, the same configuration error is deployed onto thousands of nodes and resides in persistent files on each node, making it hard to tolerate by redundancy or server rebooting. As a result, configuration errors have become one of the major causes of failures in large-scale cloud and Internet systems, as reported by many system vendors and service providers. For similar reasons, configuration errors have also introduced many security vulnerabilities, especially in data centers it has layers of access control that is hard for system administrators to set up correct. We have conducted almost SIX years of research on this topic ranging from understanding data center configuration errors to see what type of mistakes system administrators typically make (SOSP’11, FSE’15, CHI’17], to designing innovative approaches in software testing and configuration design to reduce configuration errors [SOSP’13, OSDI’16 best paper, EuroSys’15], or to diagnose them and recommend good configuration practices [ASPLOS’13].