Cloud Native Resilience: Building Scalable and Fault-Tolerant Systems

Video size:

Abstract

Learn how to design cloud-native systems that are both resilient and scalable. Drawing from experience at Meta, Amazon, and PayPal, this talk explores microservices, containerization, and automation strategies to ensure fault tolerance and high availability in dynamic cloud environments.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey, good morning, everybody. My name is Praful Kondiru. I'm a software engineer with over 10 years of experience building distributed and scalable systems. And I'm excited to be here today. in my session, we'll be exploring more about, cloud native resilience, and I'll talk a bit. More about how I build scalable and fault tolerant systems and hope my experience would have valuable lessons for you. by, you may ask what, what does cloud native resilience actually mean? and short answer for that is we mean designing systems that not only scale efficiently, but also maintain their performance and availability, even when unexpected issues arise. during this talk, we'll cover several key areas. I'll first start off by introducing cloud native concepts. and then I'll move on to talking about the core design principles that support resilience. and the most important topic of all is talking more about the role of Microsoft's architecture, containerization, and automation. and finally, we'll dive into some practical code examples and demonstrations. so yeah, this presentation is geared towards tech professionals like yourselves so you can expect in that, content along with some hands on examples. please feel free to note your questions you have and we can you can shoot me an email, if you have any questions at the end. but yeah, my goal is to equip you with all sorts of practical insights and techniques that can apply that you can apply to build robust and fault tolerant systems in your own cloud environments so yeah, so without further ado, let's get started. So yeah, so like I spoke in My previous slide. so on this slide i'll walk you through the agenda for today's session. So we'll start with an introduction to Cloud Native Concepts, the fundamental ideas behind building resilient systems. Next we'll discuss the key design principles that help us build fault tolerant applications, such as redundancy, loose coupling and graceful degradation. Then we'll explore a little bit more about the microservice architecture And, how breaking application to do smaller independent services can actually Boost resilience and scalability. Following that, we will cover containerization and orchestration tools like Docker, Kubernetes that kind of simplify deployment and management of cloud native systems. And finally, I'll talk a bit more about like the CI CD pipelines and monitoring tools just to make sure, when a system crashes, like how do we recover It's super quickly in a dynamic environment. and you'll see you in most of the slides I have like code and demonstrations to help you understand the practical nature of this. Yeah, let's come get started with the first intro So let's define what? Cloud native actually means right? So people throw this word cloud all over the place But I initially didn't like a couple of years ago. I didn't know what it actually meant but yeah, so when I when you say cloud native, it's actually an approach to building and like running applications that kind of fully Exploit the advantages of like cloud computing. so this means like designing applications to be like super scalable, elastic and resilient right from like the ground up. so there are a lot of like ways to achieve this and this is what my talk today will cover. So also talking a bit more about the key characteristics of cloud native systems. Is like I mentioned, the scalability part of it, which is to like effortlessly handle increases in workload, by adding more and more resources. elasticity is another property of like cloud native systems, which kind of dynamically adjust resources based on demand. and finally resilience which is essentially maintaining performance and like availability even in the face of failure So these are like the main characteristics of it. and The benefits with building cloud native applications are like, obviously, faster innovation, like it's super quick to deploy to cloud, obviously, and along with it, we can cheaply utilize resources. We can like smartly scale up and scale down based on the demand. if on a certain day, there's going to be like a million customers to a website. You can pre compute and scale your resources to that extent, and on off days, you can scale it down. So you're efficiently using your resources and not turning on your compute for 24 7, which is a waste of money. so yeah, moving on, okay. Why resilience actually matters? So why does this matter? so how do I define resilience, right? So essentially, like I mentioned, the ability to handle and recover from failures gracefully, and why does it actually matter? So we need high availability for our systems so that we can ensure that applications remain accessible, which is crucial for like user satisfaction, right? Or else we wouldn't want to like, lose our customers just because our website is down, and also the business impact of it. It's huge. because like reduced downtime translates directly to Lower, lower financial losses on improved service reliability. and finally, like I mentioned, the user experience as a user, I don't really want to see huge websites like Amazon done. So it like severely impacts the user trust and experience. so yeah, in the following slides, we can discuss more. On the design principles that help achieve this level of resilience and i'll show you how these Ideas are actually implemented in like real world examples yeah moving on. So so yeah in order to build Brazilian systems, it's essential to incorporate Key design principles from the start. so as an engineer when you're building a cloud native application some of the things that You would need to remember is along the lines of, redundancy, by duplicating crucial components, the system can like continue operating even if one instance fails. So that's what you call talk people. You hear people talk about like horizontal scaling, right? You got to have like multiple copies of the same server where your code is currently deployed. Cause if one of the servers goes down, you could essentially. Move all the traffic to another server and prevent any downtime to your website and another important principle is loose coupling and what I mean by that is like We got a design systems where components have like minimal dependencies on each other And ensure that failures in one area doesn't really cascade and bring down the entire system So this is another prime principle of like resilient systems and also I have to fail fast. just because I don't wanna sit and deal with a server that is down and restart it and who knows how long it's gonna take me to restart and in the meanwhile I don't want to keep my user waiting so I'll have to transfer All the traffic to like another server. And, this is what I mean by graceful degradation. I don't want users to like, just see a 404 on the screen and just be like, Oh, this website isn't really useful to me. but yeah, so another thing is the monitoring and like automated recovery part where we got to build a robust tool mechanism where we are continuously monitoring what is going on with our website. And can have smart automated systems to detect when there is like peak traffic and automatically scale your website up or down. so moving on to the next one, we can, at this point, I feel it's, in right time to dive into what is a microservice architecture. so when you talk about cloud native microservices, it's thrown around a lot. Why, you may ask, right? So in a microservices setup, an application is broken down into Extremely small and like loosely coupled services that kind of magically just work together. there's a lot of work behind obviously like this intermingling of all these different modules and microservices, right? and each of the service should be designed to perform just as like extremely specific function. It can be as simple as Oh, add these two integers and spit out the output and transfer that information to another service. this way each service is designed to perform a specific function and it makes it extremely easy to develop, test it, deploy it, and also debug it. debugging it is easy 'cause I know when something fails, which particular service to go to, to actually go and debug. so yeah, so like you, you could see, the benefits, on, on, on a screen. so independent deployment, right? So services can be updated without affecting like the entire system. Scalable, right? So like I mentioned in some of the previous slides, you can scale up and scale down your service like individual services, based on the traffic that's coming in or the volume of users or the volume of the compute that you need. you can say your application has 10 microservices and that one of your microservices is, facing a lot of intake or like volume is like rising too much. You could specifically just scale that single microservice and you should be good. you're saving money, like all the other services, but not unnecessarily scaling it up. yeah. And finally, the third one is like isolation, right? If one specific service fails, you don't, it doesn't really bring your whole system down unless it's like a core service, right? Where, for which users come and use a website, but yeah, isolation is like super crucial to like maintaining the overall system resilience. but yeah, so yeah, in the, so in a moment I'll show you like a code example, that kind of demonstrates how a basic microservices is implemented. I'll talk about how these independent service interact and support like a resilient architecture So as you could see on the screen, this is a super simple microsoft Architecture I implemented using like java and spring boot on this slide we have a basic code snippet that kind of Illustrates how to create like a restful service So in this example, the greeting controller that you see here, it's annotated with at rest controller, which kind of indicates that it'll handle web requests. Now, for example, web requests, as in when a user, say clicks on a button and API is called to the backend, right? So this is the controller that kind of handles that. so inside this controller, like another annotation called at auto wired, annotation to like inject. Greeting service, which is responsible for all the business logic so essentially you know the forward slash greet annotation maps like the http get request to the greet endpoint And essentially when this endpoint is hit The get greeting, method is called and essentially this method returns a simple greeting message called. Oh, Hey, oh, hello cloud native world. yeah, so essentially what I want to explain here is by separating the controller from like the service layer We cannot achieve a loose coupling between the components And this design not only like simplifies development and testing, but also reinforces like fault isolation, right? yeah, just a basic example, just to give you a rough idea of how Microsoft actually looks like. but yeah, moving on. so yeah, so the next concept that I really want to talk about is containerization. So whenever you're developing cloud native applications, one of the thing you should remember is. I'm building it as a microservice from the ground up and the next thing is containerization. And what I mean by that is like a method of packaging an application along with all of its dependencies into a single self contained unit is essentially called a container. so that. And the purpose of this is we ensure that the application runs reliably regardless of where it's deployed, right? and yeah, so like I mentioned on the screen, some of the benefits is like consistency, isolation, and like portability and docker. You guys would have come across docker a lot where you would have come across terms like dockerization, which, which basically means that. I essentially am placing a service in a closed container something like docker along with all the dependencies that the container That the service needs so that and upgrades happen automatically to all the dependencies. So I don't need to worry about Oh, what if this dependency is not upgraded yet and it fails my system So those are some things that I wouldn't worry about if I docker my dockerize my system. So yeah, so moving on just a super quick like demo of What I mean by containerization a microservice using docker So essentially step one create a microservice right of your cloud native applications step two If you need to dockerize it so on this slide, you'll see a docker file that kind of defines The steps to build our container, right? So essentially you'll see open JDK image Which is essentially a Java application And a work directory command to set the working directory of the container The jar file which is essentially all the dependencies that are needed You'll see the expose function informs docker like which port the application will use and in this case port 8080 and finally the entry point for it to run, right? and essentially This is the, the most common file of a docker container. And we would need this to actually run the application and essentially how you would run it is just do a build command on this, which essentially call a docker build and greeting service, which will basically create a docker image called tagged greeting service. And when you run this, when you'd say something like, Oh, docker run greeting service, it starts a container. and it essentially runs the service. So in the interest of time moving on, so then, with our Microsoft is now containerized. So the next step is how do we manage and orchestrate this containers at scale, right? so this is where Kubernetes comes into picture and you may ask, what is Kubernetes? So kubernetes is a really powerful orchestration platform that kind of automates deployment scaling and like management, right? so we need we definitely need automatic scaling for all our cloud native applications because kubernetes can dynamically adjust like the number of Instances that are running it can monitor the health and like self heal based on if a certain instances up or down And essentially it has this whole like functionality to efficiently resource, manage different resources. so yeah, it's really different, important that we use Kubernetes along with, Doc, Docker and microservices. And essentially all we have Kubernetes, like you could see is we essentially have pods where we have these containers running. And essentially, there are various commands that kubernetes has which will help us like deploy and maintain these services so moving on to this slide where we'll focus on like the automation strategies to enhance kind of system resilience so So for any cloud native application, we need to focus a lot on like system resilience and building CICD pipelines, like automate testing and run your build through a CICD pipeline. So that's, so as soon as you deploy some kind of code, we need an automated service that kind of builds a jar out of the service and then runs a bunch of automated, tests around it and make sure it is production ready. Without this we can deploy a service and without any testing involved. What if it just breaks it, right? So that'll just mess up the whole service. So we don't want that. So we'll have to build out CI CD pipelines for this And also we need a bunch of like self healing infrastructure that kind of auto scales Based on load restarts whenever the service is done so like you could see on the screen, there are a lot of monitoring integration systems like Prometheus, Grafana, and, ELK stack, on Amazon for like proactive alerts. so yeah, so moving on. So this is like a super high level of how a CI CD pipeline might look right. If, if you guys use Git in, in your, daily work, you would notice that we have like several branches, main. so whenever you want to create a pull request, you push a code. And once you like, do submit your CI CD pipelines take over. They run a bunch of like unit tests, automated tests. if it's, if it's a mobile platform, it generates an APK. And it does a bunch of like screenshots. So these are really important, to make sure none of it breaks. And like on the screen, there are a set of commands, this is essentially a settings file. So there are a lot of jobs that are run and you can modify it to such an extent that you can run your application on any service. For example, you can run it on Ubuntu, you can run it on like a Windows or a Mac and make sure your application runs properly on like different services. Moving on, we will discuss how integrating monitoring and self healing capabilities helps the resilience of cloud native systems, right? so yeah, so monitoring different metrics is useful. just cause we can see how our system is CPU memory and the network utilization, And we want to make sure none of this goes beyond a point. and if it does, we need to have things in place to make sure our instances or our containers aren't overloaded. So we have like several tools, and set up called Prometheus for collecting different metrics and like alerting our SRAs or developers just to make sure that if a service is being overloaded, we got to go in and do something to fix it. and Grafana is pretty useful for like visualization, like dashboarding as well. yeah, but yeah, so moving on. So let's quickly talk about, like a real world case study, right? so let's look at say, like a major e commerce platform, like Amazon that kind of migrated to a cloud native architecture. This is. Super hypothetical. so let's say initially the phased a lot of like frequent downtime like scalability issues during like peak load times And like really slow deployment process But once they adopted like a cloud native approach, they re architected the whole system into microservices Containerized their applications using docker And implemented like a robust CI CD pipeline along with monitoring using Prometheus and Grafana. So this way Some of the lessons that they learned was like, oh building resilience from like ground up is super essential for maintaining high availability, especially in a dynamic cloud environment. And obviously continuously monitoring and like feedback loops are crucial for ongoing improvements and like a quick recovery from unexpected failures. So this case study definitely underscores like the importance of integrating these practices to build systems to make your cloud native application like scalable. Yeah. so yeah, so that brings us to the end of this presentation. to quickly recap, we explored the fundamental principles of like cloud native, resilience, right? So including the benefits of microservices, containerization using Docker, the automation using CI and CD pipelines and monitoring using Grafana. yeah, thanks for coming to my talk.

Slides

Download slides (PDF)

See all 81 talks at this event!

Conf42 Cloud Native 2025 - Online

March 06 2025 - premiere 5PM GMT

Cloud Native Resilience: Building Scalable and Fault-Tolerant Systems

Video size:

Abstract

Summary

Transcript

Slides

Praful Konduru

Software Engineer @ Meta

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2025 - Online

March 06 2025 - premiere 5PM GMT

Cloud Native Resilience: Building Scalable and Fault-Tolerant Systems

Video size:

Abstract

Summary

Transcript

Slides

Praful Konduru

Software Engineer @ Meta

Join the community!