Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
I'm Akra.
I have over 17 years experience in performance engineering.
Today I want to talk to you about how we can build a relatable,
high performing cloud applications by applying SRE principles.
These days.
If you see cloud native applications are everywhere.
They're powerful, but also complex and unpredictable, and
that's where the challenge lies.
How do we make these systems faster, more reliable, and also
ready to scale when required?
So over the years, I've helped several Fortune 500 companies tackle exactly that.
I'll be sharing a framework that goes beyond.
Just tweaking code or infrastructure.
It brings together architecture, engineering practices and business
goals to build truly resilient systems.
Let's dive in.
So now that we have seen big picture, let's break down the framework.
We used to build resilient systems.
I like to think of it as a layered pyramid.
Each layer builds on the one below it.
So if coming to implementation fund fundamentals, so for example, in
architecture, we ca we focus on clear boundaries like separating, say order
service from building service using events instead of direct API calls.
Basically this improves fault to tolerance and reduces coupling in code.
We prevented, say, n plus one database queries use as in
corporations where possible and.
And also we made sure APIs can handle retries without duplication.
And on the infrastructure side, we use auto-scaling in Kubernetes infrastructure
as a code with Terraform and also added reduplicate to ease database.
Basically reload.
So next layer, once the foundation is strong we.
Enabled performance thinking into our delivery work workflows.
So we integrated performance testing into CICD and also automated observability and
enforced SLS throughout the lifecycle.
This makes performance basically everyone's stress.
Responsibility, not just a pro, a post-production
concern in rail layer three.
After after say engineering practices, what we did was like,
when we implemented these practices consistently, we start seeing the results
reduced response times or latencies.
Also improved throughput and better scaling.
So system become measurable, predictable, and also tuneable.
For example, in one case we brought down response times from over
400 milliseconds to just under a hundred milliseconds by optimizing
queries and also by using caching.
This I'll be talking more about in, future slides.
So ultimately if I take business value, ultimately this leads to real
business outcomes with respect to cost savings, happier users, better
uptime, and also faster releases.
In fact, in one of the project where I have worked, we have saved almost
40% on cloud cost while handling five times more users simply by
engineering smarter and better systems.
So now that we have covered foundational framework these
are some of the outcomes we have achieved in our production systems.
So first one is we have reduced API response time 200 milliseconds by
applying query tuning, optimizing caching layers, and also reducing synchronous.
Dependencies.
We brought down average response times drastically from over 400 milliseconds,
500 milliseconds, 200 milliseconds and their boot, for example, we identified
here, say unnecessary db giants in giants in DB queries, and also added
proper indexing and introduced in memory caching for frequently access to data.
So next outcome was basically like we throughput.
Increase under P load.
So almost we achieved three x improvement in throughput by
removing processing bottlenecks.
We used techniques like connection pooling synchronous processing, and
also refined the thread management to handle more request per second.
Even during the.
Peak traffic we achieved this 300% throughput in improvement.
And next achievement was basically 75% reduction in database table storage.
So we also addressed data database storage efficiency, not just by tuning
queries, but optimizing also schemas.
So this included basically, archiving, say old data, removing
unused columns, and also normalizing overly de-normalized tables.
In one case, we cut down the storage of footprint of a key table, as I was
saying earlier, by almost 70, 75%.
This also improved the performance.
Another major win for us was basically concurrent users.
Finally we were able to scale almost to IX before these improvements.
Basically we were just scaling 1000 concurrent users.
But with the same infrastructure and with the with the number of parts and
everything, we were able to scale to 500 5,000 concurrent users without
any degradation in, performance and the best part these weren't
basically theoretical wins for us.
Basically, these we achieved in real production systems with precision when
we take a holistic approach, metric and holistic and metric driven approach to
performance and resilience improvement, improvements like this basically become
very predictable and also repeatable.
So next slide.
So next slide.
I would like to talk about more about query optimization techniques we used.
So as systems scale, in my experience, almost 60 to 70% of the performance
bottlenecks I have seen in database layer.
In this in this basically database part of journey, we focused on optimizing the
way our services basically interact with the data, both at the application layer
and layer and also a database layer.
So one of the major issue was.
Problem was excessive database load.
So we were seeing spikes in database CPU and IO during peak
load slowing down critical APIs.
So much of this basically came from poorly written queries, things like fetching
more data than needed, doing expensive joins, also repeatedly querying in loops.
So one in one classic issue we tackled was n plus one query pattern,
basically for one, one API call, we were doing n plus one queries,
so this was scaling performance.
Basically what we did was like we club this entire query into one.
So this improved performance a lot.
So here in database.
So also we created indexing.
And also did the query rewriting.
We reviewed the most frequently executed and slowest queries during using
database logs and a PM tools like New Relic and also Oracle a WR Reports.
We added basically missing indexes, rewrote queries to reduce joins, and
also avoided unnecessary queries.
We used DB query hints.
By, and also in one scenario, basically by just creating a composite
indexes, basically we reduced the query response time almost around
two seconds to under 50 milliseconds.
So next we concentrated on data access pattern patterns.
So we worked with the developer developers to analyze how data was being accessed.
So if at all, suppose in a UI page, if you are seeing if you are
showing only 10 records do we really need to fetch a hundred records?
So by understanding these kind of users journeys, we tuned
APIs to retrieve only what.
Actually used and needed this basically reduced payload size and also DB load.
And another part we did was we did strategic denormalization.
So sometimes normalization basically creates too many joints
for performance critical parts.
We de-normalized just selectively, for example.
We embedded commonly joined fields directly into a reporting
table used for dashboards.
So this basically removed three joins and also query time was 80, 80% faster.
And also, as I was saying earlier, we focused on database specific
optimizations like for Oracle.
So we used the optimizer hinges, SQL plan baselines and also for my MySQL.
Basically we used explains explain plans to query to tune complex queries.
So by combining application level changes with the d. Per DB knowledge,
we cut down query volume almost by 85% in several user flows.
So if you see like a query optimization might not sound glamorous, but
it's one of the highest ROI efforts in performance engineering.
It directly impacts speed, cost, and also user satisfaction.
So next slide I would like to talk about concurrency.
So now because handling more users is not just about throwing, say,
hardware at the problem it's about using your resources very efficiently,
especially under, peak load.
So in this phase we focused on how the system handled the concurrent
load across threats, connection pools, and also different services.
So first one we did was like we did the connection pool optimization.
We noticed that under load connection saturation was leading to request
delays even dropped to transactions.
So we implemented dynamic connection pools.
Tuned just not default settings.
We tuned we adapted basically to actual workload characteristics.
So this alone reduced the connection overhead by almost
40% in our busiest services.
For example, here, instead of having one static max size, minimum size, we
configured separate connection pools for read heavy and write heavy services based
on the traffic patterns we analyzed.
And next one was we moved away from blocking threads
and shifted to non-blocking asynchronous logic where possible.
So one service that processed incoming orders basically used to block on
downstream, say, inventory calls.
So we wrote we rewrote this using an event driven architecture
and decoupling the flow.
And improving response time, even under highest loads.
So we used in the code, we used basically promises, futures and we
used the reactive like libraries like Rx, Java or Spring web flux to handle
this asynchronous flow flows cleanly.
So next in thread management, basically thread contention is one
of the hidden killers for performance for us in most of the applications.
So what we did was like we implemented custom thread pools, fine tuned workloads.
And also we introduced work scaling algorithms to redistribute
ideal threats dynamically.
So we also introduced in thread management back pressure.
So when the system was under extreme load, instead of cascading failures, we shed,
we could shed some excess load gracefully.
And another one we concentrated was timeout strategy.
So timeouts, if you think are like breaks.
So the prevent runaway resource usage.
So we designed the cascading timeout strategies at every level.
So from database to HTP clients, we also used circuit breakers.
To trip and recover services automatically if something is going wrong.
This approach prevented threat pools from being as exhausted when as a
downstream service failed May this maintained system stability as well.
So by optimizing concurrency across all these layers, we ensured the
system remained responsive even.
User load got spiked.
This wasn't just a performance for us.
Basically, it was a resilient resiliency enabler as well.
Next I would like to talk about caching strategies we used.
So when we when we think about scaling systems and reducing latency caching is
one of the powerful tools available if we use that correctly in this approach.
Basically we built a multi-layered caching strategy targeting
every layer of the stack.
From client side to database.
So if I talk about client side, basically we began with the front
end by adding cache control headers, eTax, and also service workers, we
allowed browsers to browsers and mo mod mobile apps to reuse previously
first static data, be it images.
Say CS files, js files.
Basically we version them and used stored on the client devices.
So this reduced the basically number of kits to the server for static content.
And this dropped network traffic almost by 65% for returning users.
For example here, basically for user profile images.
Settings.
We cast them locally on the client device, so only revalidated them
very periodically so that like we will not show any stale data.
Also, we use the worsening techniques here.
So next at the API gateway level we added the edge caching at the API gateway level.
So especially for frequently accessed endpoints like product
listings, configurations, and also pricing related things where
they won't change very frequently.
So we also, here, we also built a smart invalidation mechanisms so that
when the data changed in the backend only the affected cache entries
were evicted and rebuilt again.
This offloaded significant traffic from the app and db layer and also
improved a a PA response times.
Next one was application level caching.
So inside the application we used a mix of in memory cache and also
distributed cache like red multi-node and in red for multi-node environments.
We applied the cache aside PA pattern where the application
basically fast checks.
In the cache and then only queries DB if basically if entry
is not available in the cache.
Also, we fine tuned the TTL values.
Tt l is basically time to leave cache values.
So based on how volatile each dataset was, so example, static config was, say
cached for hours, and also user session related info had basically shorter TTLs.
So next we concentrated on database result caching.
For ex, for expensive DB operations large reports and complex joints, we cached the
result and invalidated it automatically when that relevant data changed.
We used ride through strategies also to keep the cache and DB in
sync, ensuring the consistency without sacrificing the query speed.
So this help us reduce query volume by 30% during traffic spikes.
So overall, if you see like a caching is not just about speed, it's about control.
By implementing the right strategy at each layer we achieved be better
respons, responsiveness, and also scalability without compromis
compromising on data say integrity.
So next I would like to talk about load balancing.
So if you see traditional load balancing, like road robin or
sticky sessions we in this complex environments just isn't cut anymore.
In a dynamic and also microservices architectures.
So in this phase we implemented intelligent load balancing, which
basically adapts in real time to system conditions and also traffic
patterns and also service health.
So first what we did was like, we did request classification.
So if you see not all requests are equal, right?
So we started by classifying request based on type, priority, and also user type.
If you, if you VIP customer is logging in, then they will get highest priority.
And also we used the resource.
What kind of resource demand that a PA call requires.
For example if you see a get profile call should not be treated the
same as say, if you are doing some reporting, say monthly report request.
Both should not be.
Treated as same.
Here, get profile should get highest priority, so that like it'll
complete faster and also user will not be waiting for get profile.
A PA call most of the times say generate monthly report.
Most of the times user will will be willing to wait.
So we tag the a PA request at a PA request at the a PA gateway level and allowing
us to prioritize and route intelligently.
So high priority transactions like.
Checkout or login those kind of transactions, basically
were given fast claims.
Next we concentrated on routing strategies.
So routing was basically made dynamic.
So based on realtime health, capacity signals traffic could be
shifted away from overloaded or degraded nodes to healthy nodes.
So we integrated service discovery with health checks so that if one node slowed
down or failed traffic would automatically reroute with the minimal impact.
So this strategy also helped during our rolling deployments
and also blue-green releases.
So next we used instead of equal load distribution, we used weighted
algorithms based on capacity latency and even based on geography.
For example, if one node had twice the CPU capacity of another, so
it got a higher traffic weight.
So we even used geolocation based routing to reduce the
cross region network latency.
So next finally we added continuous health monitoring with the
graceful degradation built in.
So if a dependent service, say, search or recommendation recommendations became
unhealthy, we return fallbacks or say cache data instead of failing outright.
So this overall improved user experience and also, also this helped
immensely during partial outages.
So together these techniques created a smarter self-aware load balancing
layer that maintained performance and also uptime, especially during failure
scenarios and also peak traffic events.
And because it was all observability driven, we could,
we can improve routing strategies based on real telemetry we used.
So next I would like to talk about the observability.
So when we, when systems grow in complexity, if you see like visibility
becomes non-negotiable, without the right visibility in place, even small
issues can become major outages.
So we built a comprehensive observability platform one that
basically combined metrics locks.
Traces and alerts into a single cohesive feedback loop.
So if I talk about metrics metrics gave us a high level view of
how our system is be behaving.
So we used two proven frameworks.
One is red that is RED rate reds and duration.
This is basically great for, services and a p monitoring.
So another framework we used was USE utilization, saturation errors.
So this is, this framework is basically ideal for infrastructure
and resource monitoring.
For example, by tracking request rates and latency across services,
we spotted bottlenecks long before they even caused di downtime.
Downtime.
So we also tracked the business KPIs like say card conversion rate
or average transaction latency.
So teams could correlate technical changes to user impact.
So next logs.
So metrics, basically, if you see like metrics tell us something is wrong,
but logs tell us why that is wrong.
So we standardized our structured log logging, so including error
codes, request IDs user IDs, and also we log the relevant payloads.
So with the correlation IDs, we could trace a single user request
across say, dozens of microservices.
And also we enrich the logs with the contextual data, like region
environment, and also feature flags to make debugging faster.
So next traces.
So if you see if we implemented dis distributed tracing using tools like
open Telemetry or agar this allowed us to visualize how we request
flowed across multiple services.
From front end to backend and see exactly where delays occurred.
For example, we identified 60% of latency in checkout flow came
from a downstream payment system.
So we could, we would not have caught that with the logs alone.
So here basically traces traces helped us.
So next alerts.
So instead of say nice threshold based alerts we used SLA and SLO driven
alerting, which only triggers when, say actual reliability goals are at risk.
We also implemented Alert, alert, correlation.
So that a cascade of downstream error alerts would not overwhelm support
engineers with redundant notifications.
So this led to a few fewer false alarms better on-call experience
and faster resolution time.
With this observability framework, the end result is basically we give
engineering teams near total vis near realtime total visibility into
their services, making it ease.
It is easier to spot, investigate and also fix issues before users.
Even notice them.
And just as importantly, we tied all this observability back into business goals.
So every alert or dashboard was rooted in real impact.
Next I would like to talk about data-driven capacity planning.
So one of the biggest challenges in any large scale system is
balancing cost with performance.
So if you overprovision, basically that leads to into higher higher
or unnecessary cloud spend.
But if you under provision basically you may face outages as well.
So our answer to this.
Is data-driven capacity planning.
So first what we did was like, we started the deep analysis of historical
utilization, looking at CPU, memory, iops, and also network usage across services.
We profiled each component and identified the seasonal traffic patterns.
For example, most of the times usage basically suggests on Monday mornings.
Or traffic spikes occur during end of quarter events.
We also built anomaly filters to distinguish between real spike or
one of the spikes a random spike.
So we would not scale infrastructure based on outliers.
So this helped us establish accurate baselines for each environment.
And also each service.
So next we did the growth modeling.
So we applied statistical models and machine learning to forecast how usage
would evolve based on, say, product growth and also business events.
We factored in new feature rollouts, marketing campaigns, and also
customer onboarding timelines.
For example, we modeled several scenarios.
So first one is best case scenario.
What is worst case scenario?
And also what is say expected scenario?
Each we associated with the confidence levels.
So this gave product and also infrastructure teams a
shared and also data back.
The plan they could lean on.
So next resource optimization.
So based on those projections, we implemented dynamic scaling strategies
across services and also infrastructure.
So we used Kubernetes horizontal pod, auto scalers, and also.
Cloud natives, scaling policies for database and also messaging systems.
We also defined resource utilization targets.
So services would scale only when truly needed and scale back when they are idle.
So this approach allowed us to strike the right balance.
Between performance headroom and also cost efficiency.
In one case, we reduced monthly cloud spend by almost 30% while
maintaining 99.9% 9 99 0.99% uptime.
Just by optimizing our, provisioning strategy.
So key takeaway from this is the capacity planning should not be a guessing game.
So when it's a data-driven, predictive and aligned with business growth, it
becomes a strategic strategic advantage.
So next I would like to talk about performance testing, which is crucial.
So in traditionally if you see like performance testing has been seen as one
time phase at the end of the product.
Development life cycle, but in modern delivery pipelines that the
approach does not work anymore.
So we treated performance testing as a first class citizen with CICD from the
smallest function to full scale low test.
So first unit test.
So we started with unit level performance test.
These are basically fast, lightweight benchmarks embedded
in directly into build.
If your core function, say a pricing calculation, suddenly slowed say slowed
by, slowed down by two x, we could.
Catch it immediately during a pull request.
This help us developers fix regressions before they reached
integration or staging environments.
So next we did service level isolated testing.
So each microservices service basically had its own performance benchmark suit,
so running in isolated environments.
This, we ran in isolated environments, so we measured the service level latency.
Throughput memory usage, and also error rates in controlled con conditions.
So we used the marks here to simulate, say, downstream systems.
So this allowed teams to independently tune their
services without needing to test.
The entire system every time.
For example, a Cadillac service could be validated against a
known data set also, or known mock endpoint for consistent response
times for different load scenarios.
So next we did integration performance testing.
So when we then tested across basically multiple workflows, how
this basically tell us how service services behaved when changed together.
This revealed issues like cumulative latency, serial
serialization, bottlenecks, and also inefficient retrial logics.
So in one case, a checkout flow that.
Fine in isolation.
Almost showed a two second delay.
When we tested it end to end.
The culprit here basically was a synchronous email service
holding up the complete flow.
So next we did full scale testing in pre-production environment.
The environment was production like environment.
So we finally, before if a, any major release.
What we did was like we ran production, like low test in pre-prod environment,
assimilating real traffic patterns.
So with realistic data, and also concurrency.
We used the tools like Geometer Gatling to mimic user load across geographies.
So those results directly mapped into to SL os.
So if any service missed its latency or error target the
deployment was paused automatically.
By integrating performance testing throughout the pipeline, we sh shifted the
left and caught performance regressions before they reached production.
This not only improved reliability, but also built the confidence
across teams that every release was truly ready for scale.
So up until now we talked about optimization, scaling, and efficiency,
but what happens when things go wrong?
Because in real world systems failure is basically unavoidable.
So that's why we introduced kiosk engineering a discipline
that lets us prepare for failure before it actually happens.
So in for kiosk engineering first what we did was like, we
did the hypothesis formation.
We started basically by forming.
Clear, testable hypothesis about how the system might fail.
For example, if the say, inventory service goes down, can checkout still
succeed using cash inventory data?
Or say if a database latency spikes for 30 seconds, will the user session expire
gracefully or degrade the performance?
We focused on these tests on critical business flows sign up.
Login, checkout, and also on data ingestion.
Once hypothesis has been formed, we experimented the design.
Experiment.
We designed experiments to simulate real failure modes.
That could mean as if sometimes killing a pod or injecting a latency
or simulating a dependency failure.
The key was here control, we scoped the blast radius and applied the
in, in it in isolated environments first, and then monitored every step.
So in one case basically we introduced 30 seconds latency to our payment
provider integration and watched how our say retry policies handle this.
We executed this test progressively in controlled manner starting in dev, moving
to stage, and then selectively into production using cannery deployments.
Each ex experiment had auto termination criteria.
If key metrics like error rate or latency crossed safe thresholds,
the test stopped immediately.
This ensured we did not cause actual harm, especially in production
while uncovering the real risks.
Next one is most important thing.
We did not stop at discovery.
We converted findings into actions.
We added circuit breakers where needed improved RY logic and also
made our fallbacks are smarter.
For example, after one test it revealed a slow down downstream.
Such API stalling the homepage.
We implemented a timeout plus a cached fallback cutting failover
time from minutes to seconds.
We also automated the many of.
These recovery patterns, so systems could self-heal instead of
waiting for a human intervention.
So if you see overall kiosk engineering gave us a proactive resiliency mindset.
We stopped the fearing of fearing for failure and started designing system
that could withstand and adapt to it.
Thank you.
This is about my session.
Thank you all for being here and taking time to explore this journey with me.
We have gone far beyond just tuning response times or adding few dashboards.
What we have seen here is a complete.
The layered approach, one that brings together architecture, code,
infrastructure, and also automation.
By engineering with reliability in mind right from the ground
up, we can build a cloud native systems that are not only fast and
scalable, but resilient by design.
My hope is that today stock sparked a new ideas, for your teams can shift
left on performance and also break silos and develop systems that perform well.
Thank you again for joining with me.