Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello.
Good morning or good afternoon everyone.
My name is LaMi.
I'm a senior E at DevOps, lead of cloud lead.
I have experience that spans both traditional middleware administration and
modern cloud native technologies making me versatile technology who consistently
cloud scalable infrastructure solutions.
Let me begin with the scenario.
Imagine a hurricane making a landfall nine one call volume search dispatch
systems are under stress and every second of downtime could cost lives.
When we think about emergency response, whether it's dispatching
ambulances, coordinating firefighters, or managing disaster relief,
downtime is simply not an option.
Today.
In this talk, I'll share how platform engineering a practice often associated
with enterprise ID can fundamentally transform emergency response terms into
cloud native platforms that save lives.
In this agenda.
First we'll begin with foundations of platform engineering and why
they matter for emergency systems.
Then drive into self-service infrastructure and developer experience.
Third, operational ex excellence patterns from orchestration to observability, and
finally, organizational studies that make this platform sustainable and adoptable.
The challenge.
Unlike traditional enterprise applications, emergency systems can't
afford even seconds of downtime.
They must scale instantly during crisis, integrate with legacy
and government systems, and still comply with strict regulations.
Think of what happens when a hurricane hits call volumes spike.
Dispatch systems much as much spawn in milli milliseconds
and for you could cost lives.
Traditional deployment model models will with long cycles and manual
process simply can't keep up.
We need a new paradigm.
The paradigm shift before platform engineering deployment took two weeks.
Errors of frequent and redundancy was limited.
Operation teams are isolated, infrastructure is inconsistent,
and geographic CIE and special knowledge is required.
And platform.
After platform engineering, we achieved three minutes.
Selfer, self-service provisioning.
Golden paths with built-in per practices had 99% reduction.
The configurators consistent patterns across 200 plus
microservices, multi-region deployment by default and hide out
complexity from application teams.
The shift creates multiplier effect.
Emergency teams can focus on their mission on saving lives, struck
firefighting infrastructure,
coming to develop a portal.
Portal is heart of our self-service platform.
It acts as a primary interface for emergency response teams, a place
where they request infrastructure, deploy services, monitor their systems.
Instead of waiting weeks from manual approvals or writing low level scripts
now have a simple gateway that abstracts complexity and towards consistency.
Here we discuss about Golden Park templates, self service
catalog, and service scorecards.
We'll dive in deep into these, all these three points, golden part templates.
Golden paths are pre-approved.
Template that developers follow to provision applications and services.
They're embedded with security observability and compliance best
practices By default, why import for emergency response during a
disaster, developers don't have time to figure out how to configure network
trucking, encryption and logging.
Golden Path ensure all services meet baseline requirements automatically.
Let's take an example.
A new emergency microservice can be deployed in under three minutes
using a Golden Path template.
It comes pre-wire with centralized logging TLS, security and modern dashboards.
No manual work is needed.
One of such example is Backstage, which is Spotify's open source
developer portal, often used for Golden Park implementations.
Golden Pass.
Might also include hand check templates, Terraform modules or Kubernetes AML
blueprints and pre prebuilt guardrails.
Coming to service catalog, the service catalog is like a minimum where teams
can pick the infrastructure and services they need without open tickets are
relying on ops teams, self-service, provisioning of databases, messages,
queues, and caches, object storage, and other cloud native components.
These are all what it offers.
Each entry in the catalog comes with built in process like hipaa
NIST, and local regulations coming to emergency response scenario.
One of the example is suppose an incident occurs that requires
scaling up realtime event.
Few of Q4 incoming calls instead days of provisioning an engineer Kafka
cluster from the service catalog and has means pre-configured for resilience.
Some of the tools and examples are.
HashiCorp Terraform Cloud Cross or a W Service catalog often powered the
backend coming to service scorecards.
Scorecards are provide realtime metrics on health, reliability,
and compliance of each service.
Think of it as a report card for every microservice, what they measure.
Reliability, uptime error rates mean time to recover.
They also offer security and compliance.
They also measure performance latency, throughput and resource utilization.
Some of the examples of pros and Grafana metrics visualization,
open telemetry for traces.
The conclusion is the developers portal is not just a dashboard,
it's productivity engine.
Golden paths.
Where Golden Paths provide saves.
Foster deployments service catalog gives instant access to critical
infrastructure and service.
Scorecards realtime visibility.
The result is what took two weeks now takes three minutes,
coming to IAC and GI tops, which are core for the, for this
infrastructure, everything is defined as code and managed to GitHubs.
We have full version control drift detection with hundred percent
coverage and 73% of deviations auto remediated immutable infrastructure
patterns mean no ad hoc changes.
Let's dive deep into the infrastructures code and DevOps infrastructures
in code and DevOps, a backbone of reliable cloud native platforms.
The transform how they, how we build, deploy, and manage emergency
response systems with consistency and trust are on negotiable.
Let's break down in these practices and why they are critical.
What is infrastructure as a code ISC means we infrastructure in code now through
manual, not through manual flick or ad hoc scripts, everything from Kubernetes
cluster load balance to IM rules is written as declarative configuration
file stored in version control.
This enables consistency, automation, and version control.
Every infrastructure change is tracked like code without history
or differences and rollback ability.
Coming to consistency, same ation of is across different a like dev and
production automation development.
Deployments happen via CA CD, pipelines without human error.
Some example tools are cloud agno agnostic, ISC tool project resources
like AWS Azure, Google Cloud platform.
And also some other example is Ansible often use it for conservation
management and small scale provisioning.
A Terraform module, take an example.
A Terraform module provisions five Kubernetes testers across different
regions with identical settings, ensuring complaints and resilience.
When a new node pool is added to the hand of the disaster load is done
by updating coding gi, not manually.
What is GitHubs?
GitHub's Accents a infrastructure is a code by using GitHubs as single
source of truth for both infrastructure and application deployment.
Any change to infrastructure or application manifest must
be committed to git a GitHub controller like August City Flex.
Continuously reconcile live cluster state with clash state in Git.
This provides auditability, automatic detection and safer deployments.
What is auditability?
Every change has an other committee history and approval workflow.
Coming to automated D detection, if someone changes the resource directly
in the cluster, GIS reverse is back.
And these deployments are also safer where changes are tested, be reviewed
and rolled out inconsistently.
So my example tools are Argo CD and Flux CD
coming to.
Key benefits of these emergency response systems are reliability,
speed, resilience, and compliance.
In short, IAC ensures that our infrastructure is
reproducible and predictable.
While GitHubs ensures that our operations are automated and audible together, they
make sure emergency systems stay online, secure, and ready to handle demand
spikes when every second truly matters.
Coming to multi cluster orchestration.
Imagine a flood knocking out an anti-D data center.
The system must continue running without missing a beat.
Resilience is critical.
Multi, multi cluster orchestration issues that from one, if one region goes down
due to a flood power outage, or cyber attacks, as services continues seamlessly,
we receive workloads across fire regions.
Maintain active clusters and pro travel to near the nearest healthy endpoint.
Multi coming to multi cluster orchestration architecture,
we have failover cluster.
We operate multiple Kubernetes clusters per region.
The this design ensures that if one cluster goes down to hardware failure,
software, bug, or even a there disaster, another cluster seamlessly takes over.
Automated failure mechanisms synchronize straight, straightforward
workloads across clusters.
This means data is not lost and operations can assume within minimum disruption.
By Thereal isolation, we prevent cascading failure.
For example, an issue in one region does not spill and impact others,
which is crucial in emergencies.
Multiple clusters per region.
Within each region, we maintain more than one cluster.
It supports active two configurations where both clusters are live share
load, and improve availability and fault tolerance During peak load, for example,
a certain 8800% spike in emergency calls during the disaster resources are
automatically distributed across clusters.
These balances, performance and ensures fosters possible response
time for first responders.
Regional balances coming to regional balances with redundancy.
Each region is frontend by regional role balances, which is distributed
incoming traffic among clusters.
These load balances themselves are redundant.
If one fails another, instantly takes over, ensuring there's
no single point of failure.
They also support intelligent routing, sending request to region.
Emergency services with failover.
Each individual emergency service such as dispatching, geolocation, or medical trial
systems is designed, built at failover.
For example, the primary instance of a dispatch service
in cluster becomes unavailable.
The system automatically roads to a secondary instance in cluster B without
manual intervention coming to Global Traffic Management and failover.
In this one of the key component of the architecture.
The reason level we use global traffic managers like DNS Road, a
cloud native global load balancing the systems direct user request
to the nearest healthy region.
Ensuring edge optimized routing with sub mil 50 millisecond latency for
entire region becomes unavailable.
For instance, due to disaster, natural disaster, global routing,
automatic failures to another region.
To summarize, this architecture combines failover at every level,
service cluster, regional and global intelligent traffic management.
This result is a platform that is a resilient complaint and ready
to further unexpected failures.
This architecture ensures that sub 50 millisecond dispatch
response times globally.
Coming to service mesh integration, a service mesh adds
another layers of resilience.
It automatically traffic to healthy services using breaking to stop
cascading failures and enforce mutual TL security for zero trust.
Security, even partial failures does not have system availability.
A must for life critical operations.
Coming to observability, of course, you can't operate at
the scale without visibility.
You can't fix what you can't see, or observability platform ES 2.3
million metrics per minute, ands 90% of alert accuracy through machine
learning and animal detection.
This enables less than 30 seconds.
Coming to policy as cloud security and governance are non-negotiable.
Governance should not slow teams down.
By adding, by admitting policy.
As code security and compliance are enforced automatically, we enforce
runtime policies automatically.
Con, continuously validate against frameworks like HIPA and secure the
supply chain with and pro checks.
Example, take an example.
A deployment violates, sometimes security is blocked without
requir manual intervention.
This approach, governance without slowing teams down, security and
complaints are built in default.
Coming to platform team, organization.
Enabled by 12 platform team supporting more than 80 developers.
Platform gathering feedback, engineer with application teams.
Four, not doing things.
It's about enabling others to move safely.
Measuring platform success.
Is how do we measure success?
Developer experience is improved 67% faster time to production, 70% fewer
stock support operational metrics, 19% availability, and 78% decrease.
In meantime to response repay.
There 30% reduction in emergency response time.
Increase in future delivery velocity and 65% decrease in.
Ultimately, these are not just numbers.
They represent faster resources, better outcomes, and save lives.
The key takeaways to close.
Service.
Key takeaways are self-service infrastructure unlocks, agility, platform
obstructions, reduce cognitive load and golden path, enforce best practices.
The journey does not start with boiling the ocean.
Begin with what?
Golden path, measure rigorously and expand.
When platform engineering meets emergency response, technology
directly translates to human impact.
That's why this work matters.
Thank you.