Incidents are expensive to the business, especially if customers leave us if we are perceived as unreliable. But failures will happen, it’s not an issue of IF, but a question of when. So how can we reduce the impact on our users? In this talk, I will review the production incident cycle, the time that we are not reliable and our users are not happy which includes the time to detect, time to repair and time between failures. I’ll share a few methods to tackle each one of those parts in order to minimize incident impact both from technical and people aspects, expending on incident response and postmortems to know what is the most important thing for us, and we want to be data driven in those decisions.
Priority access to all content
Exclusive promotions and giveaways