Conf42: Chaos Engineering 2021


Taming the spatio-temporal-causal uncertainty in Chaos Engineering and Observability

Mahesh Venkataraman
Managing Director @ Accenture

Mahesh Venkataraman's LinkedIn account Mahesh Venkataraman's twitter account

There are 2 challenges in observability. Uncertainty in prognosis decisions (false+ and false- in failure predictions) and discovering causal connections in diagnosis. We address this by modeling spatio-temporal uncertainty for prognosis& knowledge representation/ graph database for causal diagnosis

Distributed systems are complex and are prone to failures. With more and more enterprises migrating their on-prem systems to cloud, there is also increased risk of failures. These failures often happen owing to unpredictability of production system workload and usage patterns and the consequent emergent response of distributed systems which cannot be easily envisaged during design and implementation. Principles of observability built on three pillars namely, logging, monitoring and metrics attempt to observe the internal state of the system in order to perform prognosis of failure modes and post event failure diagnosis. Observability & chaos engineering techniques are usually combined by conducting planned and thoughtful experiments (inject chaos) and uncovering weaknesses in the systems by analyzing the runtime data of the system (observability).

There are multiple challenges in proactively discovering failure modes during these experiments. First, the logging and monitoring data and their visualization from observability tools is often too overwhelming and voluminous for them to be ‘actionable’. There is data deluge leading to ‘data fatigue’. Secondly, there is significant uncertainty in decisions to classify an observed response behavior as either normal or anomalous. Both false positives and false negatives have impact. During a course of series of experiments this uncertainty manifests itself in both temporal and spatial dimensions. The earlier the decision point in time (before the actual expected failure), the more possibility of false alarms and possible ‘alert fatigue’; The later the decision point (i.e closer to the expected failure in time), the less the usefulness of the decision since it is most likely too late to take action. Moreover, the longer the causal chain (spatial separation of original cause and its later effect) the more the uncertainty. We propose to use spatio-temporal models to address this uncertainty in prognosis

Another challenge in prognosis and diagnosis is determining causality and connections between events. Often, with huge amount of observability data, the causal connections between various connected events are not very clear. We propose to use knowledge representation and graph databases to automate discovery of such causal connections

Awesome tech events for

Priority access to all content

Community Discord

Exclusive promotions and giveaways