Systems fail, but the real failures are the ones from those we learn nothing. This talk is a tale of few such failures that went right under our noses and what we did to prevent those. The techniques covered range from Heterogenous systems, unordered events, missing correlations, and human errors.
Every time there is a failure, there is a root cause analysis, and there is a vow not to repeat the mistake. I will take some curious shortcomings that I have dealt with in the past decade of my work with Infrastructure systems and the steps we had to undertake to:
2. Limit the spread
3. Prevent from happening again
## Failure 1 ##
An un-replicated consul configuration results in data loss 25 hours before a countrywide launch. Took a staggering five engineers and 20 hours to find one single line of change.
## Failure 2 ##
A failed, etcD, lock forced us to re-write the whole storage on Consul and hours of migration, only to find out later that it was a clock Issue.
The above Isolation and immediate fixes were painfully long yet doable.
The real ambition was to prevent *similar* such Incidents from repeating. I will share a sample of some of our RCAs and what was missing with each one of those versions. This section touches briefly upon blameless RCA, but the real focus is the action-ability of an RCA.
In this section, I will showcase some of the in-house frameworks and technologies (easy to replicate) built to turn the prevention/alert section of RCAs into lines of code rather than lines of the blurb of text. This section aims to advertise and advocate the need to build/adopt toolchains that promise early-detection and not just faster-resolution.
Priority access to all content
Exclusive promotions and giveaways