From Dashboard to Defense: Automating Resilience at Large Scale

Video size:

Abstract

Go beyond dashboards: see how we engineered adaptive rate limiting, auto-rollbacks, circuit breakers, and closed-loop pipelines to build a self-healing, high-scale listing platform that keeps billions of transactions resilient — with zero manual firefighting.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. Thank you for joining me today. Today I'm going to talk about an interesting topic from dashboard to defenses building autonomous resilience at scale. I want to start with something we have all experienced. Imagine it's two in the morning, your phone buses. You have just been paged. You pull yourself out of the bat. Look at that. Dashboards try to make sense of graphs and scramble to contain the problem. And in that moment, how many of you have thought, why can't the system just. This itself. That's exactly the journey I want to share with you today. How we can move beyond watching problems and start building system that defend themselves in production environment, especially at scale. The 2:00 AM pager is not sustainable. Dashboards solve problems, but they don't act. And while we have all seen heroic firefighting from engineers, this model simply doesn't scale. When you are sell, when you are serving billions of requests, the mandate is clear. We must engineer reliability into the system itself, not leave it to the humans to keep everything alive. Let's start with observability. The world wave was. To collect everything, build endless dashboards, and hope someone noticed the right spike at the right time. But let's be honest, humans can't pass millions of data point in real time. The new way is SL world driven signals tied directly to user experience. We don't just measure CPU or memory, we measure whether user really succeeded. That means tracking P 95, request latency, overall availability and checkout success rate. We then set VOS and we use error budget to decide when to release quickly and when to slow down. These turns reliability into something measurable and actionable to make observability very useful. We use proven patterns. We use SL OS and error budgets as the release gatekeepers. We follow red golden signals and used frameworks. To ensure consistency across every service. Each service has a single golden dashboard, not dent, so engineers know exactly where to look. We take an open telemetry first approach, so metrics, traces, and locks all flow through the standard pipeline. And we enrich metrics with trace IDs like that. So an engineer can jump directly from a latency spike to the trace that explains it. This makes observative more than just data. It makes, it's a foundation of automation. Metric traces and locks are only valuable if they are scale. For tracing, we propagate W three trace contacts across every service, and we use tail based sampling. So we always keep the interesting data, the errors, and the outliers for locks. We use structure json, we redact sensitive data at the collector, and we route locks by severity so we can. We can look into what is necessary and don't log unnecessary noises. When these three pillars work together, you get correlated telemetry that you can trust. Agility also has to extend beyond the backend. We need change awareness Every deployment. Every feature flag toggle, so up as annotations on the dashboards. That way when the latest pump jumps or error spikes, we can immediately see if it lines up with the new release. We need client visibility, real user monitoring. Tell us about load times layout, saved, browser crashes, broken experiences. We also need edge visibility. CDN metrics show us cash, hit ratio and edge error rates, et cetera. And finally, we track capacity and cost before cost saturation, before it becomes an outage. And we take. We take telemetry with the team and service ownership to make the cost visible. This way, we aren't just monitoring the backend, we are just monitoring the end-to-end experience and business impact dashboards alone are still reactive unless you add prediction. Static threshold always fails because what is normal is always changing. So we build capacity anomaly detector that learn usage patterns and alerts before we hit hot limits. We then added AA models that forecast latency spikes and correlated logs, metrics, and traces to find out root cost faster. We also have models which deducts error spikes based on capacity and like right incoming requests and treat the issues early. One of the most powerful innovation was using the browser JavaScript error as an anomaly signal. If a rollout suddenly increases JSR, the system automatically rolls back itself before user even notice. That observability turning into practical defense observability only matters if the alerts lead to an action. The worldwide was an alert, fires your human, wakes up, your human, then looked into the dashboard and mitigate the issue. That is very slow. The new way is alerts are just alerts are used as a controlled signals, so system identifies the issue and auto corrects. If a dependence is slow, a circuit could breaker trips immediately. If error budget bonds too fast, a new deployment rolls back automatically. We use multi bond rate alerts, some of direct fast spikes, and others direct slow leaks. And we only page humans. When there is a real user impact, everything else routes to a chat or tickets, which will be handled offline. This keeps human focused on the problem that really matters, not everything, not noises. To build a trust, we use a tire strategy for automation. Tire one covers safe and reversible actions like throttles and circuit breakers. These always run automatically. Tire two covers progressive changes like Softing Regional traffic. These can start automatically, but allow for human oversight. Tire three covers complex or new incidents. The system notifies a human. But also provides runbooks and contacts to accelerate the fixed faster. This approach ensures that automation handles the EC 80% while human focus on the hard 20%. Our automation also relies on proven resilience patterns. Multiple patterns described here, the circuit breaker pattern. Use to stop cascading failures, bulk heads, isolates, failures. So one service can't take down everything. Retry with exponential. Back off NCS, we don't overwhelm services that are recovering failover and load balancing strategies automatically. Route traffic around unhealthy regions. And cars and saga patterns help us scale while maintaining consistency across distributed transaction. These patterns form our defense playbook. Delivery pipelines are not just about speed, they're about safety. Manual deployments, get manual deployment gate, slow us down and address. Human misclicks, human approved without contacts, so we build fully automated pipelines. Every release goes through a canary deployment, progressive rollouts, and automatic rollout, automatic rollback when new error spike happen, or SLOs are breached. We shifted human review from pre-deployment approvals to post-deployment analysis when real system behavior is available, the result, faster velocity and safer production pipelines have become resilience engines. Another critical safeguard is. Dynamic rate limiting. Many of our incidents were not caused by infrastructure failure, but by workload anomalies. Your buggy line spams request our internal batch of floods. Our APAs static limits always fails because they don't adopt. So we build adoptive throttling that learns normal tropic per client per endpoint. Detect sudden deviations and contain runway workloads automatically. This mechanism alone has prevented nearly 30% of our major incident. Every safeguard we have today was born from scar tissue, A cascading outage gave us circuit breakers. Yeah. Bad config rollout gave us progressive deployment. Yeah. Runway client gave us a tive throttling, so we made it cultural. Every postmortem as two things. Could this have been auto detected? Could it have been auto mitigated? If the answer is yes, we build it. That's how U turn. Pain into progress, and engineers always toil. The hardest part of all this is trust engineers, worry automation, engineers worry. Automation might make things worse, so we build it progressively in data mode. Automation only logs what it have done. In suggest mode, it recommends action, but waits for human approval only. In autonomous mode. It does act independently, always within guardrails, every action is transparent, reversible, and explainable. Over the time engineers saw that the automation was actually. More reliable than a tired human at 2:00 AM. 2:00 AM trust won't be declared. It was earned. Resilient has to go beyond technical metrics. It might tie back to business outcomes. We run UI keep alive to continuously check front end flows. We track Java, JavaScript errors. In the browser as a frontend anomaly signals. Most importantly, we watch business KPIs like card A abandon and revenue impact, et cetera. If a release looks technically fine, but causes amendment to spike, the system rolled back automatically. Reliability isn't just keeping service green. It is protecting both user and business. When you connect it all together, the blueprint is clear. Metrics become actionable signals. Alerts become closed. Loop mitigations. Deployments evolve into autonomous pipelines. Limits become adoptive safeguards. Firefighting turns into positive automation. And fear of automation becomes trust. That's how you evolve from the dashboard to defenses. So let me leave you with this. Reliability is not PagerDuty. Reliability is not dashboard Reliability is autonomous resilience. Every incident is a chance to automate. Every scar is a chance to build defense step by step. You can build system that don't just monitor themself, but systems that defend themself. Thank you.

Slides

Download slides (PDF)

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

From Dashboard to Defense: Automating Resilience at Large Scale

Video size:

Abstract

Summary

Transcript

Slides

Sureshkumar Karuppuchamy

Engineering Lead @ eBay

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

From Dashboard to Defense: Automating Resilience at Large Scale

Video size:

Abstract

Summary

Transcript

Slides

Sureshkumar Karuppuchamy

Engineering Lead @ eBay

Join the community!