Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good evening everyone.
This is Karthik Red, a senior society level team engineer specializing in
cloud native platform modernization and event driven architecture.
Over the past decade, my work has focused on building resilience,
scalable, and self-healing system on Azure Kubernetes services with an
emphasis on automation, observability, and zero downtime operations.
My journey in SRE has been about transforming large traditional
systems into high availability realtime platform using open source
technologies and modern DevOps practices.
Today I'll be sharing insights from the recent work on Kubernetes
native platform modernization.
So before getting into the topic, so I would like to share how the
current industry has been evolving.
So over the last decade, we have seen cloud architecture evolve from the static
VMs, two dynamic container workloads that can respond to change in a real time.
Kubernetes has become the backbone of this transformation.
Not just as a container orchestrator, but as an enabler of declarative
automation, resilience, and portability.
What's exciting today is how teams are rethinking reliability, shifting from
reactive firefighting to proactive engineering, where the platform itself
anticipates and recovers from failure.
This evaluation is not just technical it's cultural.
It's about empowering developers, automating recovery, and bridging
observability with operation.
So what we are gonna learn today, so one is the battle tested design
patterns, real world deployment insights and comprehensive pattern catalog.
In the battle tested design patterns the, we discover proven patterns for
building self-healing, cloud native streaming systems that ensure throughput
consistency and availability at scale.
In the real world deployment insights, we gain critical insights from higher
volume, even stream deployments.
Processing mission critical data in production environment.
Whereas in comprehensive pattern catalog, we acquire concrete tactics
and the robot toolkit for designing streamlining pipelines that withstand
failure and recover gracefully.
So let's get into the architecting for resilience streaming, foundation,
resilience, streaming systems, demands and architecture built to withstand
the unpredictable nature of distributed environment such as architecture.
Simultaneously uphold three properties, which are critical.
One is throughput, consistence, and availability.
So the throughput which sustain high volume message crossing without
performance degradation, whereas consistency which preserve data integrity
and message order despite failures and in the availability, which ensure continuous
service operation, even emits component failures, these critical properties,
empathy, the cap, the trade off.
Inherent in designing any cloud native streaming systems.
So let's discuss about the ready streams which, which is replayable
event log and observe downstream failures and coordinate consumer groups.
So in the REPLAYABLE event, log the ready stream maintenance and append
only lock enabling deterministic event play messages persist with
unique it and timestamp precisely reconstructing sequence after failure.
In the absorb downstream failures, this resilient buffer absorbs temporary
downstream failures, preventing data loss, automatic back pressure
management, safeguards against memory exhaust during extended outages.
And the coordinate consumer groups, which built in consumer group management
ensures load distribution and automatic failover failed consumers resume
processing from their last acknowledged positions, maintaining so on top of this.
So we in current, the platform has been modernized.
In large enterprises, especially those with decades of legacy systems, the
modernization is not, it's not just about containerization, it's about transforming
anti delivery models from moving from nightly batch jobs and static AP PS to
event driven scalable services that can handle millions of requests per day.
So the challenge often lies in maintaining service clusters while
ensuring compliance and security.
That's where reliability engineering principle intersects
with Kubernetes native design.
So let's discuss about the circuit breakers, which is
mitigating cascading failures.
So the close, there are three states, close the state open state, half open
state in the closed state request flow normally while the system monitor error
rates and response times for animal.
In the open state, when failure thresholds are exceeded, the circuit opens
causing requests to fail immediately.
This prevents resource exertion and cascading failures.
Half open state.
So this periodic health checks are conducted to assert service recovery.
Traffic is gradually restored as downstream service stabilize.
So let's discuss about the reliable event.
Replace for system consistent.
So in the, in this event replay, we have three aspects.
One is preserve message, order, restore system, state efficiently
ensure item potent processing.
So preserve message, order where employee partition keys and sequence
numbers to ensure deterministic message ordering during replay.
This is crucial for maintaining state consistency across distributed
systems in the restore system.
State efficiency where harness event sourcing patterns.
Combined with snapshots and incremental replay so quickly, restore system
state after failures, minimize recovery time while ensuring data item, potent
processing design message handlers to be item Putin, allowing safe replay of
duplicate messages without side effects.
This guarantees exactly once processing semantics in distributor environment.
So now let's discuss about the Kubernetes microservices partition.
And data scaling and service integration.
So in the event of ing use this ing to distribute load among consumers while
preserving message order when we talk about this event driven architectures
one of the most overlooked critical competencies like the event of, so the
event of partitioning plays a major role in parallelism and auditing each
partition as an independent commit law.
Allowing multiple consumers to process events concurrently while
preserving order with that partition.
The key is finding the right balance between the partition
count and consumer scalability.
In one of in one of our implementations, which you partition comes dynamically
based on throughput trends.
For example, billing events, speaking during business hours are synchronous
reasons after nightly basis.
This approach helped.
Maintain high rates without hitting throttling limits of
our provisioning resources.
From a site reliability engineering perspective partitioning is not just
about performance, it's about resilience.
It isolates failure domains.
If one consumer group are partition experiences, select the other, con,
the others continue processing.
Seamless.
In short if it well planned partitioning gives you both
scalability and fault isolation.
Twin pillars of reliability in any streaming system.
The second one will be that clear scaling is Kubernetes event driven auto scaling.
Which is traditional auto scaling in Kubernetes is often metric
driven, like CPU memory orent.
But in event driven systems, the real indicator of Lois Q Depth or event
plan, this is where cada shines.
So CADA allows you to scale workloads based on external metrics.
In our case, the event of consumer lack.
So when the queue backlog increases, new pos are automatically spawned as the
queue drain, they gracefully scale them.
This dynamic responsiveness eliminates ideal computer and reduce cost while
maintaining near real time process.
What I like most about Qda, it's about bringing operational simplicity
created with the HPA, which is horizontal port autoscaler, and
works natively with Prometheus metal.
From an SRE perspective, it's a perfect balance of
performance and cost efficiency.
Scaling, not by guesswork, but by data.
It's also highly observable friendly.
You can actually trace auto scaling events alongside system metrics,
which helps correlate scaling behavior with real user or bachelor pack.
Next comes the service mass integration.
So this Service MES integration is one of the most transformative
ships in Kubernetes networking.
As microservice grows, so does the complexity of managing communication
authentication which retries, encryption and observability.
So a service mesh like Linker Deep are abstracts these concerns out
of the application and handles thems the infrastructure layer.
For example, by enabling MTLS, which is mutual TLS.
We ensure secure service to service communication without
changing a single line of code.
From a reliability standpoint, servicemen gives you like powerful management
circuit braking with backup and canary routing are all policy driven.
Observability is another major advantage.
With built-in telemetry, we can trace latency per identity
bottlenecks, and detect cascading failures before the impact end user.
So now let's hear about the automated failure detention, like as we discussed,
which is liveliness, probs, readiness, probs, container restart policies, and HPS
scaling response in the liveliness pros, which verifies the application help via
HT TP endpoints, trigger container restart for unresponsive services, set appropriate
timeouts and failure threshold, whereas in the readiness groups, use startup checks
to remove unhealthy pots from service.
Discovery until ready to serve traffic prevent routing
requests to degraded instances.
Container restart policies utilize Kubernetes built in restart mechanism
with exponential backup of for automatic recovery from transition failures.
Eliminating manual intervention HPS scaling response, configure horizontal
port autoscaler policies to swiftly scale replica based on CPU memory.
Customer metrics, like as we discussed, the event of messages and processing,
like observability with Prometheus.
And so establishing the baselines for anomaly detention which is establishing
robust baselines of normal performance metrics is crucial for identifying
deviations that signal potential issues.
Pay close attention to these critical indicators.
One is message throughput.
Second one will be the consumer lag, or third one will be the error rate results.
And the fourth one is resource utilization.
Message throughput, which monitor volume and processing speed, consumer lack, which
is tracks consumer delays and message two error rate results which define acceptable
error rates and alert on breaches resource utilization, which observes CPU Memory
network and this usage for animal.
This regularly update baselines for effective automated and
scaling discussions when it comes to observability.
So one of the biggest breakthroughs for me personally has been operationalizing
observability, which is going beyond logs and metrics to actual actionable insight.
For example distributor tracing combined with real time dashboards
allows us to visualize transaction journeys end to end instead of treating
incidents as isolated problems.
We now view them as signals to improve system, device, and observability is
embedded into the platform from day one, meantime to detection and recovery
drops dramatically and engineers gain confidence in continuous delivery.
Visual bottleneck identification, which is here we are using
heat map analysis techniques.
The harness Grafana advanced visualizations to create
comprehensive heat maps.
Pinpointing performance, bottlenecks and capacity constraints across your stream.
Streaming pipeline, the visually identity which processing hot
spots and uneven load distribution.
Temporal patterns in CRO microservices network and storage I your bottlenecks.
The color coded intensity maps provide intuitive understanding of system
behavior, writing optimization efforts based on actual performance impact.
So architecting resilience in this is the pattern catalog where we have
modernized legacy ETL build NextGen event platforms and ensure SLA compliance.
So when it comes to the resilience, so where the automation plays individual
in, in achieving this zero downtime operating pipelines and progressive
delivery techniques like Bluegreen and Canary Argo rolls we which can
control risk while maintaining velocity.
Automation is not only about deployment, it's also about recovery.
Self-healing systems that restart parts, rotate secrets or failover
seamlessly having have become the foundation of reliability.
The ultimate goal is to make failure, protein predictable and
recoverable so team can focus on innovation instead of firefighting.
So the in the modernized legacy ETL, the transform traditional batch pipelines
into resilient streaming architecture.
Leveraging event driven patterns and cloud native orchestration.
Whereas in the Build NextGen event platforms, we have architect autoscaling
event driven systems with comprehensive observability built in from day one.
And in the Ensure SLA complex, we implement proactive monitoring and
automated recovery to meet strings and availability and performance requirements
for mission critical environments.
So let's get into the architecting resilient streaming system.
Which is like production ready patterns automate with Kubernetes
comprehensive observability.
The production ready patterns, which leverage proven pattern for
registry, circuit breaker, and deterministic replace strategies to
mitigate real world failure scenarios.
And in the automate with Kubernetes, which utilize liveliness props, readiness
checks, and HPO policies to create self feeling systems ensure graceful
recovery with minimal intervention.
In the comprehensive observability, which implement Prometheus and Grafana
monitoring with metric baseline lag detection and visual heat maps to
uphold SLA complaints in critical.
And so as we look ahead, the future of SRE and Kubernetes is heading toward
even greater abstraction as and in the serverless orchestration, policy driven
automation and AI assisted operations.
The combination of AI based anomaly detection with even driven
remediation will reshape how we approach reliability at scale.
The next phase is not just about, it's about the system intelligence.
Platforms that learn from incidents predict, degradation,
and adapt autonomously.
This is where the boundary between human insight and mission
position to truly start to.
Blood and so that's it.
Thank you for, thank you all for joining today.
I hope the session gave you a practical view of how Kubernetes
native design, automation, and observability can truly elevate
reliability in modern cloud platforms.
It's been great sharing my experience and I'll love to continue the conversation.
Feel free to contact or reach out.
Would like to discuss resilience, engineering, or cloud native
modernization further.
Thank you again for your time.
Thank you.