Building Resilient Emergency Response Platforms: A Cloud-Native Platform Engineering Approach

Video size:

Abstract

When seconds save lives, platform engineering becomes life-or-death. See how a 12-person team built self-service infrastructure that deploys emergency apps in 3 minutes, handles 800% traffic spikes automatically, and supports 80+ developers—all while maintaining 99.99% uptime during disasters.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello. Good morning or good afternoon everyone. My name is LaMi. I'm a senior E at DevOps, lead of cloud lead. I have experience that spans both traditional middleware administration and modern cloud native technologies making me versatile technology who consistently cloud scalable infrastructure solutions. Let me begin with the scenario. Imagine a hurricane making a landfall nine one call volume search dispatch systems are under stress and every second of downtime could cost lives. When we think about emergency response, whether it's dispatching ambulances, coordinating firefighters, or managing disaster relief, downtime is simply not an option. Today. In this talk, I'll share how platform engineering a practice often associated with enterprise ID can fundamentally transform emergency response terms into cloud native platforms that save lives. In this agenda. First we'll begin with foundations of platform engineering and why they matter for emergency systems. Then drive into self-service infrastructure and developer experience. Third, operational ex excellence patterns from orchestration to observability, and finally, organizational studies that make this platform sustainable and adoptable. The challenge. Unlike traditional enterprise applications, emergency systems can't afford even seconds of downtime. They must scale instantly during crisis, integrate with legacy and government systems, and still comply with strict regulations. Think of what happens when a hurricane hits call volumes spike. Dispatch systems much as much spawn in milli milliseconds and for you could cost lives. Traditional deployment model models will with long cycles and manual process simply can't keep up. We need a new paradigm. The paradigm shift before platform engineering deployment took two weeks. Errors of frequent and redundancy was limited. Operation teams are isolated, infrastructure is inconsistent, and geographic CIE and special knowledge is required. And platform. After platform engineering, we achieved three minutes. Selfer, self-service provisioning. Golden paths with built-in per practices had 99% reduction. The configurators consistent patterns across 200 plus microservices, multi-region deployment by default and hide out complexity from application teams. The shift creates multiplier effect. Emergency teams can focus on their mission on saving lives, struck firefighting infrastructure, coming to develop a portal. Portal is heart of our self-service platform. It acts as a primary interface for emergency response teams, a place where they request infrastructure, deploy services, monitor their systems. Instead of waiting weeks from manual approvals or writing low level scripts now have a simple gateway that abstracts complexity and towards consistency. Here we discuss about Golden Park templates, self service catalog, and service scorecards. We'll dive in deep into these, all these three points, golden part templates. Golden paths are pre-approved. Template that developers follow to provision applications and services. They're embedded with security observability and compliance best practices By default, why import for emergency response during a disaster, developers don't have time to figure out how to configure network trucking, encryption and logging. Golden Path ensure all services meet baseline requirements automatically. Let's take an example. A new emergency microservice can be deployed in under three minutes using a Golden Path template. It comes pre-wire with centralized logging TLS, security and modern dashboards. No manual work is needed. One of such example is Backstage, which is Spotify's open source developer portal, often used for Golden Park implementations. Golden Pass. Might also include hand check templates, Terraform modules or Kubernetes AML blueprints and pre prebuilt guardrails. Coming to service catalog, the service catalog is like a minimum where teams can pick the infrastructure and services they need without open tickets are relying on ops teams, self-service, provisioning of databases, messages, queues, and caches, object storage, and other cloud native components. These are all what it offers. Each entry in the catalog comes with built in process like hipaa NIST, and local regulations coming to emergency response scenario. One of the example is suppose an incident occurs that requires scaling up realtime event. Few of Q4 incoming calls instead days of provisioning an engineer Kafka cluster from the service catalog and has means pre-configured for resilience. Some of the tools and examples are. HashiCorp Terraform Cloud Cross or a W Service catalog often powered the backend coming to service scorecards. Scorecards are provide realtime metrics on health, reliability, and compliance of each service. Think of it as a report card for every microservice, what they measure. Reliability, uptime error rates mean time to recover. They also offer security and compliance. They also measure performance latency, throughput and resource utilization. Some of the examples of pros and Grafana metrics visualization, open telemetry for traces. The conclusion is the developers portal is not just a dashboard, it's productivity engine. Golden paths. Where Golden Paths provide saves. Foster deployments service catalog gives instant access to critical infrastructure and service. Scorecards realtime visibility. The result is what took two weeks now takes three minutes, coming to IAC and GI tops, which are core for the, for this infrastructure, everything is defined as code and managed to GitHubs. We have full version control drift detection with hundred percent coverage and 73% of deviations auto remediated immutable infrastructure patterns mean no ad hoc changes. Let's dive deep into the infrastructures code and DevOps infrastructures in code and DevOps, a backbone of reliable cloud native platforms. The transform how they, how we build, deploy, and manage emergency response systems with consistency and trust are on negotiable. Let's break down in these practices and why they are critical. What is infrastructure as a code ISC means we infrastructure in code now through manual, not through manual flick or ad hoc scripts, everything from Kubernetes cluster load balance to IM rules is written as declarative configuration file stored in version control. This enables consistency, automation, and version control. Every infrastructure change is tracked like code without history or differences and rollback ability. Coming to consistency, same ation of is across different a like dev and production automation development. Deployments happen via CA CD, pipelines without human error. Some example tools are cloud agno agnostic, ISC tool project resources like AWS Azure, Google Cloud platform. And also some other example is Ansible often use it for conservation management and small scale provisioning. A Terraform module, take an example. A Terraform module provisions five Kubernetes testers across different regions with identical settings, ensuring complaints and resilience. When a new node pool is added to the hand of the disaster load is done by updating coding gi, not manually. What is GitHubs? GitHub's Accents a infrastructure is a code by using GitHubs as single source of truth for both infrastructure and application deployment. Any change to infrastructure or application manifest must be committed to git a GitHub controller like August City Flex. Continuously reconcile live cluster state with clash state in Git. This provides auditability, automatic detection and safer deployments. What is auditability? Every change has an other committee history and approval workflow. Coming to automated D detection, if someone changes the resource directly in the cluster, GIS reverse is back. And these deployments are also safer where changes are tested, be reviewed and rolled out inconsistently. So my example tools are Argo CD and Flux CD coming to. Key benefits of these emergency response systems are reliability, speed, resilience, and compliance. In short, IAC ensures that our infrastructure is reproducible and predictable. While GitHubs ensures that our operations are automated and audible together, they make sure emergency systems stay online, secure, and ready to handle demand spikes when every second truly matters. Coming to multi cluster orchestration. Imagine a flood knocking out an anti-D data center. The system must continue running without missing a beat. Resilience is critical. Multi, multi cluster orchestration issues that from one, if one region goes down due to a flood power outage, or cyber attacks, as services continues seamlessly, we receive workloads across fire regions. Maintain active clusters and pro travel to near the nearest healthy endpoint. Multi coming to multi cluster orchestration architecture, we have failover cluster. We operate multiple Kubernetes clusters per region. The this design ensures that if one cluster goes down to hardware failure, software, bug, or even a there disaster, another cluster seamlessly takes over. Automated failure mechanisms synchronize straight, straightforward workloads across clusters. This means data is not lost and operations can assume within minimum disruption. By Thereal isolation, we prevent cascading failure. For example, an issue in one region does not spill and impact others, which is crucial in emergencies. Multiple clusters per region. Within each region, we maintain more than one cluster. It supports active two configurations where both clusters are live share load, and improve availability and fault tolerance During peak load, for example, a certain 8800% spike in emergency calls during the disaster resources are automatically distributed across clusters. These balances, performance and ensures fosters possible response time for first responders. Regional balances coming to regional balances with redundancy. Each region is frontend by regional role balances, which is distributed incoming traffic among clusters. These load balances themselves are redundant. If one fails another, instantly takes over, ensuring there's no single point of failure. They also support intelligent routing, sending request to region. Emergency services with failover. Each individual emergency service such as dispatching, geolocation, or medical trial systems is designed, built at failover. For example, the primary instance of a dispatch service in cluster becomes unavailable. The system automatically roads to a secondary instance in cluster B without manual intervention coming to Global Traffic Management and failover. In this one of the key component of the architecture. The reason level we use global traffic managers like DNS Road, a cloud native global load balancing the systems direct user request to the nearest healthy region. Ensuring edge optimized routing with sub mil 50 millisecond latency for entire region becomes unavailable. For instance, due to disaster, natural disaster, global routing, automatic failures to another region. To summarize, this architecture combines failover at every level, service cluster, regional and global intelligent traffic management. This result is a platform that is a resilient complaint and ready to further unexpected failures. This architecture ensures that sub 50 millisecond dispatch response times globally. Coming to service mesh integration, a service mesh adds another layers of resilience. It automatically traffic to healthy services using breaking to stop cascading failures and enforce mutual TL security for zero trust. Security, even partial failures does not have system availability. A must for life critical operations. Coming to observability, of course, you can't operate at the scale without visibility. You can't fix what you can't see, or observability platform ES 2.3 million metrics per minute, ands 90% of alert accuracy through machine learning and animal detection. This enables less than 30 seconds. Coming to policy as cloud security and governance are non-negotiable. Governance should not slow teams down. By adding, by admitting policy. As code security and compliance are enforced automatically, we enforce runtime policies automatically. Con, continuously validate against frameworks like HIPA and secure the supply chain with and pro checks. Example, take an example. A deployment violates, sometimes security is blocked without requir manual intervention. This approach, governance without slowing teams down, security and complaints are built in default. Coming to platform team, organization. Enabled by 12 platform team supporting more than 80 developers. Platform gathering feedback, engineer with application teams. Four, not doing things. It's about enabling others to move safely. Measuring platform success. Is how do we measure success? Developer experience is improved 67% faster time to production, 70% fewer stock support operational metrics, 19% availability, and 78% decrease. In meantime to response repay. There 30% reduction in emergency response time. Increase in future delivery velocity and 65% decrease in. Ultimately, these are not just numbers. They represent faster resources, better outcomes, and save lives. The key takeaways to close. Service. Key takeaways are self-service infrastructure unlocks, agility, platform obstructions, reduce cognitive load and golden path, enforce best practices. The journey does not start with boiling the ocean. Begin with what? Golden path, measure rigorously and expand. When platform engineering meets emergency response, technology directly translates to human impact. That's why this work matters. Thank you.

Slides

Download slides (PDF)

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Resilient Emergency Response Platforms: A Cloud-Native Platform Engineering Approach

Video size:

Abstract

Summary

Transcript

Slides

Lakshmi Vara Prasad Adusumilli

Senior Devops Lead/Cloud Lead @ Conduent

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Resilient Emergency Response Platforms: A Cloud-Native Platform Engineering Approach

Video size:

Abstract

Summary

Transcript

Slides

Lakshmi Vara Prasad Adusumilli

Senior Devops Lead/Cloud Lead @ Conduent

Join the community!