• Starts: 9:15 am on Friday, July 23, 2021
  • Ends: 11:15 am on Friday, July 23, 2021

Title: Automated analytics systems to navigate the complexity of performance unpredictability in cloud applications

Presenter: Mert Toslali

Advisor:  Professor Ayse Coskun (ECE)

Chair: Professor Alan Liu (ECE)

Committee: Professor Orran Krieger (ECE), Fabian Oliveira (IBM Research)

Abstract: Performance unpredictability is a major roadblock towards cloud adoption and has cost and revenue ramifications. As a result, engineers spend most of their time in (1) diagnosing performance problems using monitoring data of the applications and (2) frequently updating their applications to remediate performance. The state-of-the-art systems for (1) propose automated techniques to choose the monitoring data (e.g., logs) needed to reduce time-to-solution in diagnosis efforts. However, these systems either focus on correctness problems, not performance, or are designed for single machines, not distributed applications, or ignore requests’ workflows and thus may enable logs in areas of the application that are not performance sensitive. Automation solutions for (2) aim to prevent a malfunctioning version from being fully deployed by rolling back if needed. However, they lack the statistical rigor necessary to accurately assess and compare application versions and ignore optimization for business-oriented metrics (e.g., click-through rate) under operational constraints (i.e., service level objectives [SLOs]). They, therefore, are highly subject to incorrect rollout decisions. This thesis seeks to design automated analytics methods and systems for cloud applications that minimize dependency on expert knowledge, reduce time-to-solution, and help make applications more resilient. The thesis is divided into two research venues that cover two significant aspects of overcoming cloud performance unpredictability. In the first venue, we aim to demonstrate that dynamically adjusting instrumentation (i.e., automatically enabling distributed traces) using statistically driven techniques helps localize performance problems and provides the discriminate context (i.e., request workflows) needed for effective diagnosis. Second, we aim to show that learning-based methods considering both SLOs and business concerns, unlike existing solutions, improve the accuracy in online experimentation, hence the application performance beyond current limits. This thesis aims to make the following contributions: (1) developing a distributed tracing framework for cloud applications that dynamically adjust instrumentation for effective diagnosis of performance problems, (2) developing a learning-based and statistically robust online experimentation system for version rollouts in cloud applications.