The Curious SRE: Cultivating a Mindset for System Resilience

Video size:

Abstract

This talk explores the foundational mindset for building reliable systems. We’ll share practical experiences, examine the balance between breadth and depth in SRE, and emphasize holistic system thinking.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Building large scale for tolerant and reliable systems is a, the long process is a hectic process. It goes over multiple iterations, but everything that's large and long starts with some basic blocks. And one of the basic blocks here is having the right mindset. The stock will explore about framing your thoughts, having the right mindset and cultivating it over the period of time so that you get your infrastructure right. Hello, everyone. This is, I work for Google on Google search infrastructure and the stock. I will be talking to you through the lessons that I have learned over the course of these years, and basically building high, highly reliable, fault tolerant, and scalable. Webscale applications. My experience ranges from like small scale startups to this large behe North Google, and I'll be laying down some basic principles that are really required for you to define and decide what should be your thought process while going about building any large scale systems. Before we actually start taking deeper into how do we cultivate the mindset of being an sre, let's try to understand what a, what a definition of an SRE is and how that has really evolved over the course. Over the course of last 30, 35 years, I feel there has been lot going on in the computer society and the computer as far as in 30 years. We started somewhere around main competitors, used to set up, used to set up switches, all handheld. It has evolved to a state where in 2000 there was a dotcom bubble, a dotcom post where like thousands of dotcom sites came online. People started thinking internet as a new way of, of basically, serving services, giving out information, and tons of companies came online and did the rock compost. Internet was an era. And last 10 years, what I see is there's something called a cloud burst that has really happened where multi-tenant architectures, SaaS applications have really taken a long haul. You see like tons of applications SaaS applications are now thriving in the industry, and they all really started around two. Pick any big cloud providers, they are thriving as of today, and that has really made a drastic shift from like, you know, what, managing a, a, a mainframe computer was what managing the physical color was, what was what you really need to do when you're dealing in virtual your environments. You don't have to deal with all the physical environments, but to some degree, the principles have still stayed the same, but there has been a big that has happened. If you move ahead. Like the present era. The present era is kind of an AI era where you see a big boost of agent applications. And what I see here on is like in next 10 years, this agent applications would be the next big thing. And that probably you might have something you come up after the next 10 years, but if you see the relationship between a server and a client, it has evolved. So has the technologies that are basically pairing. These two things together have evolved. I mean, look at the very basic technologies like Evolution of two G, internet, then it went to 3G, 4G, 5G. You are essentially talking about the number of bits that are sent over the network and those are exponentially going up as those number of bits that are sent the network have. So have the threats, so have the technologies powering them. So have the customer expectations of experiences that you should be delivering to your customers have evolved and so has the operations that were basically necessary to support this kind of heavy internet traffic has changed. So that's how the evaluation of computer has basically evolved the role of engineer. I should certainly mention that security and compliance is also a very important thing that has evolved with this ever evolving landscape. And as a site reliability engineer, I think our perspective of looking at security and compliances has changed a lot with lot of things like G-D-P-R-P-C-I, hipaa, FedRAMP. And those kind of compliances are basically making a move site reliability engineers nowadays have, have started thinking into how do I make a system more reliable? At the same time, I cannot be compromising on the security and the compliance aspect of it. So I think as the reliance on the internet has increased so has the so has the feature offerings or the applications that are served over the internet have. Streaming very basic application informational content. Now we are in the age where you are basically delivering very critical, very sensitive information over the internet and lot of lives, and lot of people are dependent on the internet. There there is a big spectrum where you can basically, basically deliver all the gamified, gamified, TikTok kind of a content to delivering critical human care. So I think it, it becomes more and more necessary to ensure that your systems are reliable and people, real people are depending for their real needs on the system. I mean, looking at this thing, I, I, I, I still feel what has really made SRA now think is as the systems have evolving, as this new things are built, built, we really need to. I think as as we are mindset the first and the foremost question that I would really ask myself is, what's the size of the problem am I really trying to solve here? And when I say the size of the problem, it essentially boils down to what size of the company, what is the number of fuses are we talking about, and what are the company prior? The priorities and are align the goals of your role with the priorities of the company. For a, in companies of different sizes, let's say it's a small scale seed state startup where the resources are pretty constrained. You really have limited number of users. The priority of the company is basically delivering x number of features. Y number of days and x and y higher, unimaginable when basically come from a large organization to a small organization where the delivery speed is very different. For me, I think the most important that thing for a small scale organization would be optimizing for delivery speed. And sometimes it's, it's fine to compromise on some of the reliability aspects there, but probably investing that you can deliver those features pretty quickly to the end users and automation for deliver investing into the build process so that those features are quickly and reliability. It probably would be the most important thing for that smaller scale organiz. But as this company scale, as you go from like a very stage to something like a growth stage company, series BCDC, where you have a number of, I think aspects start taking more precedence, you really need to be aware that. Depending on it on the user experience and the along with the feature. So you need to basically wear different hats to ensure that the reliability aspect of your product is relatively worth. And then you are also at the same time investing in the right set of tools to to. Those set of users, you are basically investing into the right SLOs. Defining the right SLOs, finding the right reliability metrics, monitoring them over a period of time, ensuring that you have the right set of technologies to support those those metrics. You are basically. Displaying those metrics on your site so that people are aware that they can trust your application, they can trust your site, and this is pretty much available on all of the important SaaS applications you want. Today. They kind of display their reliability metrics and that is more like a, a selling point for this companies where they demonstrate that, hey, we have been 99.99% available for like past 60 months. Whatever, like past 60 days and that's the reason you should be buying a product. Most of the security companies are, are kind of using this metric as a unique selling point and that basically, is something that you should be considering for that stage of the organization. I think also accordingly, the security and the compliance requirements evolve as the com companies get picture. Maybe you can just do away with like some basic PC. IDS is kind of compliances, but as those mature, maybe FedRAMP could be a next priority based on the set of customers you're addressing. The different security challenges might need investment into different tools for delivering the secure experiences you should need to invest into those next level of tools and technologies At that stage of the. And yeah, I mean that that's what essentially happens like, you know, as you grow from a very small organization to a very large organization, you need to really prioritize your reliability story with the size of the organizations, with the goal of the organization so that there is some coherence onto what we are building, how it is being consumed. And what's the end goal of your, a very core aspect of design thinking as far as the reliability is concerned is important for a site reliability engineer. You need to really evolve with the company. I can, as you are defining what assertive mindset looks like and how do you basically cultivate this mindset? Some of the core principles that are clearly required by defining your reliability story is, are some of the things that I have here. First thing that I would really ask myself is what the customer experience is gonna look like. What's my end user? What is it really expecting from my application, and how can I really build my reliability story around it? So. Building a customer. Building a SRE mindset really starts with customer expertise first, what is your customer? What is he expecting from that application? To put things in more detailed perspective or like to highlight an example. Let's say for an example like presently the company I work for deliver some end number of 10 blue links in whatever, X number of seconds. For us, the most critical, important and important aspect is basically delivering those end number of results within a certain timeframe. We need to be accurate as much as possible, and at the same time, you need to deliver those experiences to the end user within one second of timeframe. And that's where most of the optimization and most of our energy goes in. Whereas one of the last organization I've worked for, I think. Latency was not much of a concern there. I think back then we really optimized for reliability. We had a status page, which basically said that we have been up for like last last 30 days for 99.99% of the time. And that, that is one of our metrics that, you know, you can reliably trust us to use and build your application on top of that. So I think each, each, a reliability story essentially starts with customer experience. And once you get a grip on the customer experience, you can invest in the right set of tools, the right set of technologies, and basically define your metrics of success for reliability based on what a customer experience is gonna look like. I always think the most important thing that UNI should to do, not just from the Sari perspective, but even from a Korean perspective, is. Defining before you actually dive. Planning is the most important aspect of any career, any phase, and even for site reliability engineers. I think that's the most important thing that we should be doing before you actually take on any project. Defining what are the metrics of success. Writing it down like, you know, Hey, these are the five set of things that I would be addressing with my this X, Y, Z project. These are my end goals. These are the metrics. And the SO definitions for me, this is how I'm basically gonna chop out the entire landscape. I think defining well in advance doesn't really solve the problem, but it at least keeps you on the track. It at least keeps you from getting detract on something that you shouldn't really be addressing. It. It helps people review your plans. It helps people basically guide you. So I think defining before you dive is one of the most important things that, sRD should consider from the mindset building perspective. The third thing that I think is important is basically balancing velocity or, or innovation with reliability. This to some degree I explained in my previous slide is like, as the maturity of the organization, as the size of the organization grows from x, y, z stage, I think your reliability story should really evolve. But there should be a threshold where you are really balancing the velocity. You are really balancing your innovation with reliability because real people are dependent on real things for your applica on your application. I think balancing velocity with the right reliability knobs and bolts is, is very crucial. And the third thing and the fourth thing that I think is, is crucial. As a site reliability engineer is keeping on top of my mind, like, you know, things are always gonna fit. Like, you know, you, you cannot build for success. You have to design for failure. Things are always gonna fail. There are always gonna be eight scenarios where, which you never consider, which always stay off your plate. And which is something you should replan it once. And this is where. This aspect and this design thinking comes in place like, you know, you have to really design for failure. You are not designing for success, and that way you really bring in those critical aspects and critical thoughts on your mind. Like, you know, what are the different ways this application can fail and how could I avoid that? One of the things I we largely do over at Google is even like, you know, running a lot of the tests it's more like a, more like simulating failures in an application before you have actually productionalized it or, or like, you know, you have onboarded a new feature is like, you know, running through different scenarios and trying to simulate what are the different failure scenarios you can come across. And whether it's it's degrading customer experience one another, do we have the right alerting and monitoring set up so that we get. I alert it well in advance before customer notices it. And third, do we have the right mitigation strategies in place so that if in case there is a fallout, we have a way to mitigate that thing before customers actually feel the heat and you have a bad reputation going out there. So I think these are like different strategies. Guarding, you know, having the right guardrails for an application in the build phase in the. In the, you know like in the delivery phase and define what are the different places where like, you know, your network can fail, your disks can fail, your servers can fail, things can go out of order. As in data centers my previous company, we ran a multi easy and multi data center kind of infrastructure. Just to ensure that, you know, if a specific data center of a specific region is down, we are always taken care by another region. You know, there is failover scenarios, there is replication, there is like, you know, backups, there is standby backup. There is a third backup so that you never lose your data. So all these things essentially spin up from the same. From a very core aspect that you really need to think about failure while you are designing this application and designing it for failure is one of the very crucial aspects of design thinking as far as site reliability is concerned. Another thinking that I think that has really helped me and I think is very important as far as the design thinking of this reliability aspect is concerned is keeping in mind that reliability is a continuous process. It's not something that, you know, you're gonna define at the start and then you are gonna achieve at, at a certain point it's, it's like this infinite curve, which never really reaches a goal, but each. Each progression is, it's like it takes you to closer to your destiny. And you should think that as a continuously evolving process where each incremental improvement is basically take you closer and closer to your end goal of defining and, you know, delivering this super awesome, reliable experience for their customers. And I mean, where do you start? Like, you know, like. It's, it's very difficult at the start. It looks like, oh, you're gonna deliver this pleasant experience for this millions of users. But where do you start? I think each, each great automation always started with a very hacky script. Something that was done manually once, second, and the third time someone really thought of like, you know, writing a back script to get things done. I think that's where the, the. Automation really starts, and it's okay to be hacky. It's okay to write your back script. It's okay to have like in a very buggy script to start with, but that's the first step you need to take before you actually start investing into way better tools and way better processes out there. Keep learning, keep evolving, but at the same time, you need to start somewhere, one. Second, keep your continuous involvement and continuous improvement going on because this, this is a continuous cycle. It is never gonna end. Each incremental step is gonna basically take you closer to your end goal. And you need to basically keep yourself engaged and involved in improving the reliability aspect. And for that, I think you really need to invest in right, a set of. As I, as I said in the previous slide, you know, define the right SLO metrics, define the right reliability metrics, maybe start with 99% of reliability for an application, and eventually challenge yourself that, oh, next year we're gonna target 99.9% reliability, 99.99% of reliability or availability of our application. And take it for me, like, you know, 99 and 99.9 is a huge curve, 99.9. Point nine 2, 99 0.99 is the second exponential investment that you really need to make. There are some crazy numbers out there at Google, like, you know, that we consider, like, you know, each incremental line is like tons of, tons of efforts basically to get that going. But I think the point I want to make is this is a continuously evolving process. It's a, it's a process that is gonna. Keep on going, and you really need to keep yourself focused and you need to start somewhere. As for building this reliability process, oh yes. And don't expect perfect results at the start. It is gonna evolve. It is gonna basically reach to a state where you're gonna be happy for that state of your customer and for that stage of your company. But eventually there would be a next evolution story that you would go through with. You might want to scrap a lot of things and build all over again. So it's, it's, it's a constant process. You need to really keep yourself evolving as the reliability and and the story of your organization evolves. As, as we are thinking as we are basically learning to understand, how do you start to think? I think one of the most important things and the questions to answer here is, how do you learn? Because there's this vast amount of information out there. You can go all the down to the stack, like at the kernel level and understanding how, like, you know, how interrupts work, how operating systems are basically designed to how paging and low level memory mapping really works too. To things like, you know, how networks are designed, how, how P-C-P-I-P model works and how basically with this I mean, you can, you can look at the entire spectrum and go all the way to like how this modern agent AI applications are designed and what's this ML algorithm really doing and how this ML machines and ml models are deployed. I think. My point here is this is a worst domain. Like, you know, you can always keep learning. There's also a security aspect of it. There are vulnerabilities, there are different attacks. And as a SIT, as a cyber reliability engineer, you really need to keep yourself heads about the ground and understand like what's going out in the market to be aware, like, you know, these are the set of vulnerabilities out there. You need to basically take those design considerations while you're designing those applications. So compliance requirements is, is a another aspect that you need to really keep yourself with. And while you are like keeping yourself up with all this traditional aspects of learning, this lot evolving in the computer industry, Nvidia keeps growing out new new hardwares. Google is coming up with new technologies, there are new tools coming up in the market. Something that you will. Invested for like last five years suddenly gets changed. You were running your monolithic applications or like maybe design your VPCs in cloud and design some E two applications and suddenly there's a thing called hot Kubernetes probably helps you do all those things in the span of a second. Maybe that's how that is gonna solve your problem. So, I mean, the point here is you are continuously learning. You're continuously evolving. Where do you keep yourself grounded is the most important question. And that's where I think I, I believe in this concept of t model of learning, where you keep yourself breath going, you keep yourself learning, you keep attending sessions, you keep attending conferences, you, you keep understanding what is going out in the market. But at the same time, there is one single domain. There is one single choice of your learning that you keep going deep down. And be, be a expert of that field that is actually gonna keep you going on the long run. And this, there's a number of things that are happening on the, on the breadth side of the world, which will actually help you navigate your career in the right direction. Otherwise it gets very crazy. Like, you know, you cannot really keep up with what's happening out there. You cannot really understand each and every thing and how it really operates. And it's fine. Like, you know, understanding that you don't know everything. It's way better than saying that I know everything and I can fix everything. I, I mean, as a site reliability engineer, that's one of the important learnings that I have, that I have had over the years is you cannot know everything. There are only be certain aspects of the system that you can really know very well, but that also doesn't mean that you should just stay invested in that aspect of the problem. You need to increase your broader scope of learning. You need to just. At least have a right balance of like, you know, what depth and what breadth you are basically having for different set of tools, technologies, and the new evaluations that is happening in the, in the market. And this, this goes a long way, I think the another thing that is very core to a site reliability engineer is basically dealing with incidents like I always feel Cy Reliability engineer is more like a soldier standing in the forefront and ensuring that that is experiences of his end users are always protected and mitigated in the right way. And while you are doing. This entire incident handling process. I think one great learning that I have had over the course of this n years is blameless culture. It is a very important part, OFCY engineers learning to understand what this blameless culture essentially means. It is very helpful in basically when you are defining postmortems, when you're writing things down. On a postmodern talk why something happened, how something happened, it really addresses the aspect of steam. It addresses a collective problem solving approach where you define what really happened, how it could be avoided, what you need to ensure what could be done so that it doesn't really happen and you are not pointing fingers at other people. You are basically taking it as a. The responsibility, your own responsibility in owning things for yourself. And I think that this thing really goes a long way. The second thing while dealing with fires in production is basically you really need to start thinking about going from this firefighting mode to basically a fire prevention mode. If there is fire for the first time, maybe. It was not avoided. It couldn't be avoided. But the second time, if that happens, then probably having a mindset that we need to really address this for a long term is something that should come to your mind. And I think that is very important because that's how you keep yourself invested into new things rather than going back to the same old aspects and keep fixing things again and over again, over, also from prevention perspective, I always feel it is more important to invest in long term projects. If something has happened for once or twice, you should really invest your energy in ensuring that this doesn't happen and what's the long term solution to basically address this rather than showing short term mitigations. Short term things are medications, which are good to stop the bleeding, but long term solutions ensures that they never occur or reoccur. So that your energy is not wasted in figuring out the problem, or it's not always you are who is dealing with the problem. So you are also saving your teammates basically from from the fire. So that that's a important thing. The third and the most important thing over here is you really need to move away from tools and move towards principles. What are the principles on which. This this observability and this reliability story is built on because most of the times the fundamental aspects of reliability, scalability, observability, they always have stayed the same. Irrespective of what tools you have used and what technologies you are using to power your applications, you move from one company to another. You move from one application to another. The most important thing is. The principles always these things operate, always stay the same. So I would generally invest my energies in understanding and learning more about these principles rather than the actual tools. You can pretty much learn a tool in fuel, a number of days, get away it and move on a new tool. But I think investing your right energy and understanding these principles is very important and that's how you pass on the knowledge. It is very crucial as insight team Junior if you are defined a certain principle, if acknowledge something, you really need to write it down, pass it on to the next ari, or make this that, make it more like a defining principle for your team so that everyone acknowledges it and follows by it. So that it's more like a standard. So setting standards is very important for for the psychological engineer. I think last and the most important thing with the life of an ARI or the mindset of an sari is you should be involved and thinking about building a community, staying with the community, because I feel community learning is very important. That keeps you ab upgraded and you abreast with what's happening in the society. So going out, connecting with the people, knowing what's happening out there, sharing your thoughts, sharing your insights to the people. Evangelizing what you have done within an organization is very important because it not just helps you grow, but also helps other people grow along with you. And it's the collective mindset that solves a larger problem. I feel we have been here because we are standing on this. Shoulders of this thousand great people who build underlying technologies. And we are like, building the next layer of technology solutions on top of that. So I think it is very important for Recit liability engineer to develop a mindset where he has to learn, but at the same time he has to like, share this knowledge and information for the outer world to basically evolve, learn, keep sharing, keep growing. And yeah. That's, that, that's, I feel is something of very crucial importance for a site library to engineer. Yes, that's all I had for this session. Hope you enjoyed my thoughts and and my ideas about site reliability and the design thinking. If you have any questions, please feel free to reach out to me. If you have any questions, I'll be happy to guide you along. And thank you for the organizations for organizing this awesome event. Event. Thank you.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

The Curious SRE: Cultivating a Mindset for System Resilience

Video size:

Abstract

Summary

Transcript

Slides

Saurabh Phaltane

Senior Site Reliability Engineer @ Google

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

The Curious SRE: Cultivating a Mindset for System Resilience

Video size:

Abstract

Summary

Transcript

Slides

Saurabh Phaltane

Senior Site Reliability Engineer @ Google

Join the community!