Conf42 Chaos Engineering 2021 - Online

Taming the spatio-temporal-causal uncertainty in Chaos Engineering and Observability

Video size:

Abstract

There are 2 challenges in observability. Uncertainty in prognosis decisions (false+ and false- in failure predictions) and discovering causal connections in diagnosis. We address this by modeling spatio-temporal uncertainty for prognosis& knowledge representation/ graph database for causal diagnosis

Distributed systems are complex and are prone to failures. With more and more enterprises migrating their on-prem systems to cloud, there is also increased risk of failures. These failures often happen owing to unpredictability of production system workload and usage patterns and the consequent emergent response of distributed systems which cannot be easily envisaged during design and implementation. Principles of observability built on three pillars namely, logging, monitoring and metrics attempt to observe the internal state of the system in order to perform prognosis of failure modes and post event failure diagnosis. Observability & chaos engineering techniques are usually combined by conducting planned and thoughtful experiments (inject chaos) and uncovering weaknesses in the systems by analyzing the runtime data of the system (observability).

There are multiple challenges in proactively discovering failure modes during these experiments. First, the logging and monitoring data and their visualization from observability tools is often too overwhelming and voluminous for them to be ‘actionable’. There is data deluge leading to ‘data fatigue’. Secondly, there is significant uncertainty in decisions to classify an observed response behavior as either normal or anomalous. Both false positives and false negatives have impact. During a course of series of experiments this uncertainty manifests itself in both temporal and spatial dimensions. The earlier the decision point in time (before the actual expected failure), the more possibility of false alarms and possible ‘alert fatigue’; The later the decision point (i.e closer to the expected failure in time), the less the usefulness of the decision since it is most likely too late to take action. Moreover, the longer the causal chain (spatial separation of original cause and its later effect) the more the uncertainty. We propose to use spatio-temporal models to address this uncertainty in prognosis

Another challenge in prognosis and diagnosis is determining causality and connections between events. Often, with huge amount of observability data, the causal connections between various connected events are not very clear. We propose to use knowledge representation and graph databases to automate discovery of such causal connections

Summary

  • Distributed systems are complex and prone to failures. The challenge is to digest massive quantities of data to find patterns and correlate seemingly unrelated events for performing prognosis and prognosis. There are multiple challenges in proactively discovering failure modes and performing diagnosis.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You the title of my talk is taming the spatio temporal uncertainty in observability. Distributed systems are complex and are prone to failures. These failures often occur due to unpredictability of production, system workload and usage patterns in production, and the consequent emergent runtime response of distributed systems, which cannot be easily predictions during design, observability and chaos. Engineering techniques are usually combined by conducting planned and thoughtful experiments to uncover weaknesses in the system by analyzing the runtime data of the system. Outcomes of observability observability is a measure of how well the internal states of a system can be inferred from knowledge of its external outputs. In practice, however, observability must enable us to accomplish two objectives. First is failure diagnosis, which is essentially inferring these cause from effects, and secondly failure prognosis, which means enabling early warnings of any impending failure based on current observed behavior and projected failure pathways. So the objectives of observability are prognosis and diagnosis. So data must be interpretable and actionable. Every tool promises more observability, but often we see that more data and interactive dashboards leads to data deluge and decision dilemma. The challenge is to digest massive quantities of data to find patterns and correlate seemingly unrelated events for performing prognosis and prognosis. The challenges there are multiple challenges in proactively discovering failure modes and performing diagnosis. During these experiments. Logging and monitoring data and their visualization from observability tools is often too overwhelming and voluminous for them to be actionable. There is very low signal to noise ratio since many systems interact. Semantic reconciliation and correlation of all that data is very difficult. Large volume of events, determining and reuse of third party components aggravate the challenge. While many may argue that more the data these merrier in reality, however, the more data you have, the less insights you can discover due to noise and nondeterminism. This diagram gives an idea of how one can take an integrated view of prognosis and diagnosis. As you can see, there is a fault or an anomaly at a particular instant. Initially, the subsequent event pathways after this event could take three possibilities. First, it could lead to failure. Second, it's a transient fault and modes not lead to any serious failure. Third, it's actually a false alarms and really not a fault as originally believed. As you can see, there is significant uncertainty here. There is significant uncertainty in deciding to classify an observed behavior, an observed response behavior, as either normal or anomalous. Both false positives and false negatives have impact during a course of series of experiments. This uncertainty manifests itself in both temporal and spatial dimensions. The earlier the decision point in time before the actual expected failure, the more the possibility of false alarms and possible alert fatigue, the later the decision point that is closer to the expected failure in time, the less the usefulness of the decisions, since it is most likely too late to take action. Moreover, the longer the causal chain, that is, the spatial separation of original cause and its later effect, the more the uncertainty. In causal analysis, there is always confusion and uncertainty between what is cause, what is correlation, what is consequence, what is confounder, identifying coincidences and what is association. There is always confusion between these various aspects and concepts, often with huge amount of observability data. These causal connections between various connected events are not very clear. What chain of events led to the failure event? Often a failure is never the result of a single chain of events, it's a network. Multiple conditions lead to a failure event. The causality challenge has been a classical one from aristotelian times to now. It cannot be easily solved and often this problem is underestimated. There is an assumption that data analysis can determine causality. This is far from true in any prognosis and diagnosis process. There is uncertainty, as we saw earlier in prognosis, the uncertainty that given current state of these system, what pathways will the system state traverse in time? Will it lead to failure or a normal behavior? If it is going to fail, where will it fail, which layer will fail, what component and when is it likely to fail? But there is uncertainty in space, the space here being architecture layer space, and there is uncertainty in time as to when the failure will occur. For example, if there is a spike in cpu and memory usage accompanied by other events, will that likely cause a failure of a service through a long causal chain of connected events? If so, when and where will that occur? Note this problem can be modeled as a spatiotemporalcausal problem. We can draw inspirations from other disciplines like city traffic modeling, weather modeling, epidemiology, cancer treatment, and social networks. For example, in cancer prognosis, one cloud do prognosis as to where the source of cancer is, how the metastatic pathways do take place in the patient, and that's a spatiotemporalcausal problem modeling. Similarly, other disciplines like social networks, the problem is very similar in observability. Spatiotemporalcausal data analysis is an emerging research area due to the development and application of novel computational techniques, allowing for the analysis of large spatial temporal databases. Spatial temporal models arise when data are collected across time as well as space and has at least one spatial and one temporal property. This is true for observability data. Every data has a space, a property of space and a property of time. Here is an approach we suggest we provide a very high level approach these to taming the spatio temporal causal uncertainty from system under normal conditions is ingested into a spatiotemporalcausal database. The data under fault injection condition is also ingested into these database. The spatiotemporalcausal model consists of multiple techniques using statistical techniques, time series analyzing, association rules, data driven predictions techniques, bayesian networks, pattern recognition, Markov's model, cluster analysis, etc. All this analysis is done in both time and space. In addition, there is also a knowledge graph using discovery of semantic connections, what we call the qualitative reasoning derived from text outputs like logs which are very important for some kind of reasoning and causal chain analyzing. Also, it is to be emphasized that it is important not to aim at 100% automation of prognosis and prognosis, but cause these prognosis and diagnosis engine as complementary to these human expertise. The engine provides various recommendation to these human user like most likely fault pathways, anomalous behavior alerts, imminent likely failures, events, time to failure, hierarchical failures, alarms, failures, recovery recommendations and event correlation. As you can see, all these are recommendations that the human expert cloud interact with and dig deeper and explore further to enable better prognosis or better diagnosis. In summary, data deluge is a huge challenge in observability. Integrated prognosis and diagnosis should be the outcome of observability. The problem here is modeled as spatio temporal causal uncertainty predictions and in causal analysis. And last, machines should complement human expertise and not replace and cannot replace human expertise in conducting intelligent prognosis and diagnosis.
...

Mahesh Venkataraman

Managing Director @ Accenture

Mahesh Venkataraman's LinkedIn account Mahesh Venkataraman's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways