Conf42 Site Reliability Engineering (SRE) 2025 - Online

- premiere 5PM GMT

Beyond Optimization: Engineering Resilient Cloud Microservices with SRE Principles at Scale

Video size:

Abstract

Discover how SRE principles transform cloud microservices performance at scale. I’ll share battle-tested strategies that slashed response times, optimized resources, and strengthened resilience. Learn practical techniques that deliver both technical excellence and measurable business value.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. I'm Akra. I have over 17 years experience in performance engineering. Today I want to talk to you about how we can build a relatable, high performing cloud applications by applying SRE principles. These days. If you see cloud native applications are everywhere. They're powerful, but also complex and unpredictable, and that's where the challenge lies. How do we make these systems faster, more reliable, and also ready to scale when required? So over the years, I've helped several Fortune 500 companies tackle exactly that. I'll be sharing a framework that goes beyond. Just tweaking code or infrastructure. It brings together architecture, engineering practices and business goals to build truly resilient systems. Let's dive in. So now that we have seen big picture, let's break down the framework. We used to build resilient systems. I like to think of it as a layered pyramid. Each layer builds on the one below it. So if coming to implementation fund fundamentals, so for example, in architecture, we ca we focus on clear boundaries like separating, say order service from building service using events instead of direct API calls. Basically this improves fault to tolerance and reduces coupling in code. We prevented, say, n plus one database queries use as in corporations where possible and. And also we made sure APIs can handle retries without duplication. And on the infrastructure side, we use auto-scaling in Kubernetes infrastructure as a code with Terraform and also added reduplicate to ease database. Basically reload. So next layer, once the foundation is strong we. Enabled performance thinking into our delivery work workflows. So we integrated performance testing into CICD and also automated observability and enforced SLS throughout the lifecycle. This makes performance basically everyone's stress. Responsibility, not just a pro, a post-production concern in rail layer three. After after say engineering practices, what we did was like, when we implemented these practices consistently, we start seeing the results reduced response times or latencies. Also improved throughput and better scaling. So system become measurable, predictable, and also tuneable. For example, in one case we brought down response times from over 400 milliseconds to just under a hundred milliseconds by optimizing queries and also by using caching. This I'll be talking more about in, future slides. So ultimately if I take business value, ultimately this leads to real business outcomes with respect to cost savings, happier users, better uptime, and also faster releases. In fact, in one of the project where I have worked, we have saved almost 40% on cloud cost while handling five times more users simply by engineering smarter and better systems. So now that we have covered foundational framework these are some of the outcomes we have achieved in our production systems. So first one is we have reduced API response time 200 milliseconds by applying query tuning, optimizing caching layers, and also reducing synchronous. Dependencies. We brought down average response times drastically from over 400 milliseconds, 500 milliseconds, 200 milliseconds and their boot, for example, we identified here, say unnecessary db giants in giants in DB queries, and also added proper indexing and introduced in memory caching for frequently access to data. So next outcome was basically like we throughput. Increase under P load. So almost we achieved three x improvement in throughput by removing processing bottlenecks. We used techniques like connection pooling synchronous processing, and also refined the thread management to handle more request per second. Even during the. Peak traffic we achieved this 300% throughput in improvement. And next achievement was basically 75% reduction in database table storage. So we also addressed data database storage efficiency, not just by tuning queries, but optimizing also schemas. So this included basically, archiving, say old data, removing unused columns, and also normalizing overly de-normalized tables. In one case, we cut down the storage of footprint of a key table, as I was saying earlier, by almost 70, 75%. This also improved the performance. Another major win for us was basically concurrent users. Finally we were able to scale almost to IX before these improvements. Basically we were just scaling 1000 concurrent users. But with the same infrastructure and with the with the number of parts and everything, we were able to scale to 500 5,000 concurrent users without any degradation in, performance and the best part these weren't basically theoretical wins for us. Basically, these we achieved in real production systems with precision when we take a holistic approach, metric and holistic and metric driven approach to performance and resilience improvement, improvements like this basically become very predictable and also repeatable. So next slide. So next slide. I would like to talk about more about query optimization techniques we used. So as systems scale, in my experience, almost 60 to 70% of the performance bottlenecks I have seen in database layer. In this in this basically database part of journey, we focused on optimizing the way our services basically interact with the data, both at the application layer and layer and also a database layer. So one of the major issue was. Problem was excessive database load. So we were seeing spikes in database CPU and IO during peak load slowing down critical APIs. So much of this basically came from poorly written queries, things like fetching more data than needed, doing expensive joins, also repeatedly querying in loops. So one in one classic issue we tackled was n plus one query pattern, basically for one, one API call, we were doing n plus one queries, so this was scaling performance. Basically what we did was like we club this entire query into one. So this improved performance a lot. So here in database. So also we created indexing. And also did the query rewriting. We reviewed the most frequently executed and slowest queries during using database logs and a PM tools like New Relic and also Oracle a WR Reports. We added basically missing indexes, rewrote queries to reduce joins, and also avoided unnecessary queries. We used DB query hints. By, and also in one scenario, basically by just creating a composite indexes, basically we reduced the query response time almost around two seconds to under 50 milliseconds. So next we concentrated on data access pattern patterns. So we worked with the developer developers to analyze how data was being accessed. So if at all, suppose in a UI page, if you are seeing if you are showing only 10 records do we really need to fetch a hundred records? So by understanding these kind of users journeys, we tuned APIs to retrieve only what. Actually used and needed this basically reduced payload size and also DB load. And another part we did was we did strategic denormalization. So sometimes normalization basically creates too many joints for performance critical parts. We de-normalized just selectively, for example. We embedded commonly joined fields directly into a reporting table used for dashboards. So this basically removed three joins and also query time was 80, 80% faster. And also, as I was saying earlier, we focused on database specific optimizations like for Oracle. So we used the optimizer hinges, SQL plan baselines and also for my MySQL. Basically we used explains explain plans to query to tune complex queries. So by combining application level changes with the d. Per DB knowledge, we cut down query volume almost by 85% in several user flows. So if you see like a query optimization might not sound glamorous, but it's one of the highest ROI efforts in performance engineering. It directly impacts speed, cost, and also user satisfaction. So next slide I would like to talk about concurrency. So now because handling more users is not just about throwing, say, hardware at the problem it's about using your resources very efficiently, especially under, peak load. So in this phase we focused on how the system handled the concurrent load across threats, connection pools, and also different services. So first one we did was like we did the connection pool optimization. We noticed that under load connection saturation was leading to request delays even dropped to transactions. So we implemented dynamic connection pools. Tuned just not default settings. We tuned we adapted basically to actual workload characteristics. So this alone reduced the connection overhead by almost 40% in our busiest services. For example, here, instead of having one static max size, minimum size, we configured separate connection pools for read heavy and write heavy services based on the traffic patterns we analyzed. And next one was we moved away from blocking threads and shifted to non-blocking asynchronous logic where possible. So one service that processed incoming orders basically used to block on downstream, say, inventory calls. So we wrote we rewrote this using an event driven architecture and decoupling the flow. And improving response time, even under highest loads. So we used in the code, we used basically promises, futures and we used the reactive like libraries like Rx, Java or Spring web flux to handle this asynchronous flow flows cleanly. So next in thread management, basically thread contention is one of the hidden killers for performance for us in most of the applications. So what we did was like we implemented custom thread pools, fine tuned workloads. And also we introduced work scaling algorithms to redistribute ideal threats dynamically. So we also introduced in thread management back pressure. So when the system was under extreme load, instead of cascading failures, we shed, we could shed some excess load gracefully. And another one we concentrated was timeout strategy. So timeouts, if you think are like breaks. So the prevent runaway resource usage. So we designed the cascading timeout strategies at every level. So from database to HTP clients, we also used circuit breakers. To trip and recover services automatically if something is going wrong. This approach prevented threat pools from being as exhausted when as a downstream service failed May this maintained system stability as well. So by optimizing concurrency across all these layers, we ensured the system remained responsive even. User load got spiked. This wasn't just a performance for us. Basically, it was a resilient resiliency enabler as well. Next I would like to talk about caching strategies we used. So when we when we think about scaling systems and reducing latency caching is one of the powerful tools available if we use that correctly in this approach. Basically we built a multi-layered caching strategy targeting every layer of the stack. From client side to database. So if I talk about client side, basically we began with the front end by adding cache control headers, eTax, and also service workers, we allowed browsers to browsers and mo mod mobile apps to reuse previously first static data, be it images. Say CS files, js files. Basically we version them and used stored on the client devices. So this reduced the basically number of kits to the server for static content. And this dropped network traffic almost by 65% for returning users. For example here, basically for user profile images. Settings. We cast them locally on the client device, so only revalidated them very periodically so that like we will not show any stale data. Also, we use the worsening techniques here. So next at the API gateway level we added the edge caching at the API gateway level. So especially for frequently accessed endpoints like product listings, configurations, and also pricing related things where they won't change very frequently. So we also, here, we also built a smart invalidation mechanisms so that when the data changed in the backend only the affected cache entries were evicted and rebuilt again. This offloaded significant traffic from the app and db layer and also improved a a PA response times. Next one was application level caching. So inside the application we used a mix of in memory cache and also distributed cache like red multi-node and in red for multi-node environments. We applied the cache aside PA pattern where the application basically fast checks. In the cache and then only queries DB if basically if entry is not available in the cache. Also, we fine tuned the TTL values. Tt l is basically time to leave cache values. So based on how volatile each dataset was, so example, static config was, say cached for hours, and also user session related info had basically shorter TTLs. So next we concentrated on database result caching. For ex, for expensive DB operations large reports and complex joints, we cached the result and invalidated it automatically when that relevant data changed. We used ride through strategies also to keep the cache and DB in sync, ensuring the consistency without sacrificing the query speed. So this help us reduce query volume by 30% during traffic spikes. So overall, if you see like a caching is not just about speed, it's about control. By implementing the right strategy at each layer we achieved be better respons, responsiveness, and also scalability without compromis compromising on data say integrity. So next I would like to talk about load balancing. So if you see traditional load balancing, like road robin or sticky sessions we in this complex environments just isn't cut anymore. In a dynamic and also microservices architectures. So in this phase we implemented intelligent load balancing, which basically adapts in real time to system conditions and also traffic patterns and also service health. So first what we did was like, we did request classification. So if you see not all requests are equal, right? So we started by classifying request based on type, priority, and also user type. If you, if you VIP customer is logging in, then they will get highest priority. And also we used the resource. What kind of resource demand that a PA call requires. For example if you see a get profile call should not be treated the same as say, if you are doing some reporting, say monthly report request. Both should not be. Treated as same. Here, get profile should get highest priority, so that like it'll complete faster and also user will not be waiting for get profile. A PA call most of the times say generate monthly report. Most of the times user will will be willing to wait. So we tag the a PA request at a PA request at the a PA gateway level and allowing us to prioritize and route intelligently. So high priority transactions like. Checkout or login those kind of transactions, basically were given fast claims. Next we concentrated on routing strategies. So routing was basically made dynamic. So based on realtime health, capacity signals traffic could be shifted away from overloaded or degraded nodes to healthy nodes. So we integrated service discovery with health checks so that if one node slowed down or failed traffic would automatically reroute with the minimal impact. So this strategy also helped during our rolling deployments and also blue-green releases. So next we used instead of equal load distribution, we used weighted algorithms based on capacity latency and even based on geography. For example, if one node had twice the CPU capacity of another, so it got a higher traffic weight. So we even used geolocation based routing to reduce the cross region network latency. So next finally we added continuous health monitoring with the graceful degradation built in. So if a dependent service, say, search or recommendation recommendations became unhealthy, we return fallbacks or say cache data instead of failing outright. So this overall improved user experience and also, also this helped immensely during partial outages. So together these techniques created a smarter self-aware load balancing layer that maintained performance and also uptime, especially during failure scenarios and also peak traffic events. And because it was all observability driven, we could, we can improve routing strategies based on real telemetry we used. So next I would like to talk about the observability. So when we, when systems grow in complexity, if you see like visibility becomes non-negotiable, without the right visibility in place, even small issues can become major outages. So we built a comprehensive observability platform one that basically combined metrics locks. Traces and alerts into a single cohesive feedback loop. So if I talk about metrics metrics gave us a high level view of how our system is be behaving. So we used two proven frameworks. One is red that is RED rate reds and duration. This is basically great for, services and a p monitoring. So another framework we used was USE utilization, saturation errors. So this is, this framework is basically ideal for infrastructure and resource monitoring. For example, by tracking request rates and latency across services, we spotted bottlenecks long before they even caused di downtime. Downtime. So we also tracked the business KPIs like say card conversion rate or average transaction latency. So teams could correlate technical changes to user impact. So next logs. So metrics, basically, if you see like metrics tell us something is wrong, but logs tell us why that is wrong. So we standardized our structured log logging, so including error codes, request IDs user IDs, and also we log the relevant payloads. So with the correlation IDs, we could trace a single user request across say, dozens of microservices. And also we enrich the logs with the contextual data, like region environment, and also feature flags to make debugging faster. So next traces. So if you see if we implemented dis distributed tracing using tools like open Telemetry or agar this allowed us to visualize how we request flowed across multiple services. From front end to backend and see exactly where delays occurred. For example, we identified 60% of latency in checkout flow came from a downstream payment system. So we could, we would not have caught that with the logs alone. So here basically traces traces helped us. So next alerts. So instead of say nice threshold based alerts we used SLA and SLO driven alerting, which only triggers when, say actual reliability goals are at risk. We also implemented Alert, alert, correlation. So that a cascade of downstream error alerts would not overwhelm support engineers with redundant notifications. So this led to a few fewer false alarms better on-call experience and faster resolution time. With this observability framework, the end result is basically we give engineering teams near total vis near realtime total visibility into their services, making it ease. It is easier to spot, investigate and also fix issues before users. Even notice them. And just as importantly, we tied all this observability back into business goals. So every alert or dashboard was rooted in real impact. Next I would like to talk about data-driven capacity planning. So one of the biggest challenges in any large scale system is balancing cost with performance. So if you overprovision, basically that leads to into higher higher or unnecessary cloud spend. But if you under provision basically you may face outages as well. So our answer to this. Is data-driven capacity planning. So first what we did was like, we started the deep analysis of historical utilization, looking at CPU, memory, iops, and also network usage across services. We profiled each component and identified the seasonal traffic patterns. For example, most of the times usage basically suggests on Monday mornings. Or traffic spikes occur during end of quarter events. We also built anomaly filters to distinguish between real spike or one of the spikes a random spike. So we would not scale infrastructure based on outliers. So this helped us establish accurate baselines for each environment. And also each service. So next we did the growth modeling. So we applied statistical models and machine learning to forecast how usage would evolve based on, say, product growth and also business events. We factored in new feature rollouts, marketing campaigns, and also customer onboarding timelines. For example, we modeled several scenarios. So first one is best case scenario. What is worst case scenario? And also what is say expected scenario? Each we associated with the confidence levels. So this gave product and also infrastructure teams a shared and also data back. The plan they could lean on. So next resource optimization. So based on those projections, we implemented dynamic scaling strategies across services and also infrastructure. So we used Kubernetes horizontal pod, auto scalers, and also. Cloud natives, scaling policies for database and also messaging systems. We also defined resource utilization targets. So services would scale only when truly needed and scale back when they are idle. So this approach allowed us to strike the right balance. Between performance headroom and also cost efficiency. In one case, we reduced monthly cloud spend by almost 30% while maintaining 99.9% 9 99 0.99% uptime. Just by optimizing our, provisioning strategy. So key takeaway from this is the capacity planning should not be a guessing game. So when it's a data-driven, predictive and aligned with business growth, it becomes a strategic strategic advantage. So next I would like to talk about performance testing, which is crucial. So in traditionally if you see like performance testing has been seen as one time phase at the end of the product. Development life cycle, but in modern delivery pipelines that the approach does not work anymore. So we treated performance testing as a first class citizen with CICD from the smallest function to full scale low test. So first unit test. So we started with unit level performance test. These are basically fast, lightweight benchmarks embedded in directly into build. If your core function, say a pricing calculation, suddenly slowed say slowed by, slowed down by two x, we could. Catch it immediately during a pull request. This help us developers fix regressions before they reached integration or staging environments. So next we did service level isolated testing. So each microservices service basically had its own performance benchmark suit, so running in isolated environments. This, we ran in isolated environments, so we measured the service level latency. Throughput memory usage, and also error rates in controlled con conditions. So we used the marks here to simulate, say, downstream systems. So this allowed teams to independently tune their services without needing to test. The entire system every time. For example, a Cadillac service could be validated against a known data set also, or known mock endpoint for consistent response times for different load scenarios. So next we did integration performance testing. So when we then tested across basically multiple workflows, how this basically tell us how service services behaved when changed together. This revealed issues like cumulative latency, serial serialization, bottlenecks, and also inefficient retrial logics. So in one case, a checkout flow that. Fine in isolation. Almost showed a two second delay. When we tested it end to end. The culprit here basically was a synchronous email service holding up the complete flow. So next we did full scale testing in pre-production environment. The environment was production like environment. So we finally, before if a, any major release. What we did was like we ran production, like low test in pre-prod environment, assimilating real traffic patterns. So with realistic data, and also concurrency. We used the tools like Geometer Gatling to mimic user load across geographies. So those results directly mapped into to SL os. So if any service missed its latency or error target the deployment was paused automatically. By integrating performance testing throughout the pipeline, we sh shifted the left and caught performance regressions before they reached production. This not only improved reliability, but also built the confidence across teams that every release was truly ready for scale. So up until now we talked about optimization, scaling, and efficiency, but what happens when things go wrong? Because in real world systems failure is basically unavoidable. So that's why we introduced kiosk engineering a discipline that lets us prepare for failure before it actually happens. So in for kiosk engineering first what we did was like, we did the hypothesis formation. We started basically by forming. Clear, testable hypothesis about how the system might fail. For example, if the say, inventory service goes down, can checkout still succeed using cash inventory data? Or say if a database latency spikes for 30 seconds, will the user session expire gracefully or degrade the performance? We focused on these tests on critical business flows sign up. Login, checkout, and also on data ingestion. Once hypothesis has been formed, we experimented the design. Experiment. We designed experiments to simulate real failure modes. That could mean as if sometimes killing a pod or injecting a latency or simulating a dependency failure. The key was here control, we scoped the blast radius and applied the in, in it in isolated environments first, and then monitored every step. So in one case basically we introduced 30 seconds latency to our payment provider integration and watched how our say retry policies handle this. We executed this test progressively in controlled manner starting in dev, moving to stage, and then selectively into production using cannery deployments. Each ex experiment had auto termination criteria. If key metrics like error rate or latency crossed safe thresholds, the test stopped immediately. This ensured we did not cause actual harm, especially in production while uncovering the real risks. Next one is most important thing. We did not stop at discovery. We converted findings into actions. We added circuit breakers where needed improved RY logic and also made our fallbacks are smarter. For example, after one test it revealed a slow down downstream. Such API stalling the homepage. We implemented a timeout plus a cached fallback cutting failover time from minutes to seconds. We also automated the many of. These recovery patterns, so systems could self-heal instead of waiting for a human intervention. So if you see overall kiosk engineering gave us a proactive resiliency mindset. We stopped the fearing of fearing for failure and started designing system that could withstand and adapt to it. Thank you. This is about my session. Thank you all for being here and taking time to explore this journey with me. We have gone far beyond just tuning response times or adding few dashboards. What we have seen here is a complete. The layered approach, one that brings together architecture, code, infrastructure, and also automation. By engineering with reliability in mind right from the ground up, we can build a cloud native systems that are not only fast and scalable, but resilient by design. My hope is that today stock sparked a new ideas, for your teams can shift left on performance and also break silos and develop systems that perform well. Thank you again for joining with me.
...

Sudhakar Reddy Narra

Senior Staff Performance Engineer @ ServiceNow

Sudhakar Reddy Narra's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)