Conf42 Site Reliability Engineering 2023 - Online

Why SRE is the best way to improve efficiency in crises time like these?

Video size:

Abstract

After Covid 19 pandemics we expect to have some peace, but on the contrary a war begun and a bubble generated by Big Techs is bursting the labor market. 2023 is being called the eficiency year, and how SRE can be the way to deliver the efficiency needed ? This what we’re goingo to explore!

Summary

  • Fabio: Why SRE is the best way to improve efficiency in crisis time like we are living today. We will talk about history, what's going on on these financial markets, and why companies should invest their efforts in improving SRE practices.
  • We were living a huge age of transformation, as we call the age of digital transformation. After the pandemics, we saw a bubble exploding around the world. What was a hot market is now a market where we don't have a lot of opportunities. Chat GPT is a revolution that we are seeing in the last few months.
  • SRE will improve system reliability and faster incident resolution. SRE would increase the collaboration inside the company. It would also reduce downtime costs. Why should companies invest on SRE?
  • Why should consulting firms invest in SRE? They are under a huge pressure, I believe that higher pressure than the other companies. They should adopt SRE because it will differentiate them. When we are under attack, this kind of reducing would help a company to save a lot of jobs.
  • 6% of the market is totally immature in these practice. 32% that's emerging, only emerging. 80% of market has the opportunity to save money to improve their operations only by adopting SRE practices combined to DevOps.
  • Almost 70% reducing the MTTR it's a lot of things. Ensuring security vulnerabilities are detected and eliminated quickly. Design experiments, running tests to reduce risk of production failure. And you can see here how you can use your team in a better way.
  • SRE in your organizations dedicate the largest amount of their time on an average week. 75% of the companies that answer these state of SRE, they work with slos. SLO is the key indicator for us to work with SRE. Some are really good slos to implement in order to be successful.
  • As soon as you evolve adopting DevOps and SRE, you should consider evolving to an AIOPs environment. I would believe that chatDPT and all these new generative AI would increase the adoption of AI ops. We should work in order to keep this on the right way.
  • SRE increased the commitment and morale of the team. Besides the obvious qualitative benefits, SRE generates quantitative results. You should work well in defining the slos. That will be the most important thing in these SRE adoption.
  • Fabio: Let's embrace SRE and make this world better. It was a huge pleasure to be here. Thank you and have a nice year for every one of you. Bye.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, this is Fabio and I will talk with you about why SRE is the best way to improve efficiency in crisis time like we are living today. It really reminds me about that song of Foo fighters, times like this and we are all pretty sure that these times are not being as easy as we would like to. So we will deep dive in this theme. We will talk about history, what's going on on these financial markets, and why companies should invest their efforts in improving SRE practices inside their companies and with their teams. Right? A little bit of history and things that happened in the last decades. We were living a huge age of transformation, as we call the age of digital transformation. Everybody called about it for a long, long time in many conferences, articles, and a lot of companies had a lot of money because of this, because every company in the world was running a digital transformation. And I put some things here that I believe that are the most important regarding some things that happen in the world, some technology advances and things like that. So I would like to start talking about virtualization as an old guy like me, at the beginning, every project we had, we need a server to work in the project. So we had to buy the server, install the server and then start working. And these during this period we have the virtualization that is the mother of the cloud computing. That's a huge event that we still see today that changed everything around these world. We have a completely move in the way we did the project management. We moved from the waterfall to an agile methodology where we are supposed to implement software as soon as we can, every time as possible. As soon as I sprint was done, we should deploy software to our customers. So because of that we have the rise of DevOps. That was the end of that. Silos inside the IT departments between operations and development. So after that and after the launching of the iPhone, we saw a huge adoption of mobile technologies. And after that, a few years later, we are coming closer to our current time. We have the Covid-19 pandemics which accelerated a lot all of the digital transformation around the world. At the beginning we have some companies firing people, but sooner, just a few months later, the market was really hot. A lot of companies working to get digital in order to guarantee that everybody could still work and consuming inside their homes during the period of isolation and things like that. So we have a lot of opportunities during these pandemics and during this period we have something that's really strange that was called quiet bursting where everybody was, I'm not so happy were today and I will just do what I need to do to keep my job. And that was called quiet quitting. It was something that was a lot of companies were concerned about it. Another situation that we had was regarding developers that was having many simultaneous jobs, developers working in two these four companies at the same time because the market was so hot, there were so many opportunities and everybody was trying to get their money from that. So we have this situation with overworking, as I like to say. And regarding technology, in the end of the last few years, everybody was talking about Web 30, metaverse and NFT. And now what we are talking about, what changed a lot during these last few months. We are leaving now what I like to call the age of digital eficiency. We are not concerned anymore on just being digital. We are concerned about being eficiency efficient in the digital way. Why is that happening? After the pandemics, we saw a bubble exploding around the world. In the economic scenario, we have a word in a recession in many countries, you have an inflation that we didn't sre for many years. A lot of companies are cutting their budget. Everybody that was working to manage their budgets for 2023, they are reviewing these results, how much they will invest, these would invest in this year and then start the wave of layoffs as I will explore a little more in the next few slides. And we saw a new technology emerging that it was not on our rear view mirror. Nobody was expecting that we should have this change around the technology. But the Chat GPT is a revolution that we are seeing in the last few months and that's changing everything. Everybody's feeling pressured to be more effective and chatt could be also can opportunity moving forward. We have here delay. These is panic in the world. Panic, panic, panic around the world. Because what was a hot market is now a market where we don't have a lot of opportunities. So just a few examples. Amazon during the last few months had eliminated around 27,000 positions, right? 27,000 employees was fired from Amazon. Another big tech here, Microsoft laid off 10,000 employees. Salesforce, they reduced 10% of their workforce. Meta Facebook, they have eliminated around 13% of their employees during these last year and they expect to do more during this one. Accenture, a huge consulting company, they laid off 19,000 people. So what does it bring to us? Everybody's losing their job, many people, not everybody for sure, because we have billion people in the world, but we have a lot of people losing these jobs. And that's something that have put a lot of pressure inside companies and on thinking what should we do to guarantee that we won't have to fire anyone? Nobody likes to fire people. I'm an executive. I had to fire some people in the past, but that's not a pleasant situation. So nobody wants to be in that position. And the object of my lecture were, is how we can use SRE to prevent these kind of situation. So let's see in the next few slides. As I told before, 2023 is the year of efficiency. Everybody's trying to save money, and I like to sre something with you guys. I was working in a huge company a few months ago, and I was responsible for doing some savings strategy. But what was the expectation of my company to do to generate savings? Adopting power platform and dashboards. How would I was supposed to improve the developers eficiency? Just doing that, it's not possible. And that's the reason why I'm not working there anymore, because that's something that I strongly don't believe. I believe that you should do something more structural. And one of my hypotheses is investing in SRE to do so. And let's see what I can bring to you in order to do that. But in the end, what all these 2003 crisis brings to us, what does it mean in the end? Obviously investment, but no investment. People and companies would only invest what they can. And that's something that you should invest, because nobody wants to get the risk of losing money and soon had to fire anyone. Everybody's searching for saving on operations, okay, I will not grow. And how can I save money inside my operations? How can I automate my operations and generate some savings? How can I better use my team to work better, generate more, right? And as we all know, some companies are staking this moment to eliminate some people that was not performing well, people that they knew that were working in many companies at the same time, things like that, companies are taking this moment, I will do some kind of diet, were with my employees. I have somebody that is not performing well, or maybe they are working here in two more consulting firms. That's not something that should be acceptable, but people were doing that. So companies are taking these opportunities too. And what is on the table for us, what we as technologists should be thinking of? We should thinking about eficiency and adopting AI. Everybody's talking about Chat GPT, but at this very moment, I cannot say that very easy for us to adopt Chat GPT in order to change the game and save people. But what I can say is that if we can invest our efforts on eficiency and adopting Sre the right way, I strongly believe that we can do a good job and some positions right. And we will explore more. And why should companies invest on SRE? First of all, first of all, SRE will improve system reliability. If your system is more reliable, probably you won't have many incidents. And then we go to the second point, faster incident resolution. If you solve your incidents in a faster way, your products will be available for your customers for more time. You won't lose any selling things like that. So one thing turns to the other. I have a more reliable system, so I will resolve my incidents faster. I will increase my agility because as we all know here, SRE and DevOps are kind of brothers, twin brothers that share some similar missions. So our agility will increase significantly. So that would be something really important for companies. SRE would increase the collaboration inside the company. When operation teams and development teams work together, they seem to collaborate more, they create more confidence with each other and so they will be able to deliver best software, improve all these environment inside the IT department. And I would believe that it would really diminish these pressure under these teams. Everybody in technology in these last months are feeling very pressured, afraid of losing their jobs. And when we work in this collaboration environment, it would bring some peace for these people and they will see the results of their efforts, right? That would be really great for your company. And of course, if we have all of this, all of our customers will be more satisfied. The systems that we use will be available for more time. We won't have incidents, things like that. So customer satisfaction shouldn't increase. And that would probably help us with our nps and other indicators that we might have in our company. Okay, continuing. Why should companies invest? Continuing. Reduce downtime costs. Every time our application is down, we are losing money with SRE. We won't do that. We don't lose that money. We will increase our efficiency. It's the theme of our lecture. So when you implement it, when you are in a mature way, working with SRE, you will increase your whole eficiency. All your team will work better. You will deliver more software. If you don't have to spend so many time fixing bugs or working in incidents, there's a huge probability that you can spend more time developing new features and bring more business to your company. You will have a better resource allocation because people won't be investing time fixing bugs, but developing new features or improving your environment. Everything, it's a virtual cycle. You will improve your scalability. When you have a moment where you have a lot of access in your platforms, your scalability will be doing very well, because your team will be not invest a lot of time improving this scalability. And for sure you will reduce your maintenance cost, because you won't have some bugs, you won't have problems in your infrastructure, you will scale faster, and then all your maintenance costs should be reduced. And following here, this is really interesting, because we always think about SRE, considering that we would improve our internal environment and the services that we would provide to our customers. But what about consulting firms? Why should they invest in SRE? They are under a huge pressure, I believe that higher pressure than the other companies, because when we have any cries, the first thing companies cut is these investment on technology. And that's the main reason we have consulting firms. So a lot of them sre under a huge pressure of customers, of clients canceling their projects. And why should they invest on SRE in their development teams and their practices. They should adopt SRE because it will differentiate them. And that's what will be on the table for the consulting firms. Because when we have now chat, GPT and everybody under crises time, main factor to decide between a company or other is paying these same money, I would have more code, more application. What will be better for me investing in this company instead of the other? And when a company consulting firm that adopt SRE, in their practices, in their culture, probably will deliver more software, not only deliver more, but deliver in better way, with better code, low bugs and thinking about the scalability, the software that you will deliver, will reduce the meantime to repair, will improve these security of the applications, and will be more reliable on the perspective of the application of the architecture, right? So we should invest everyone in SRE. During this period. When we talk about SRE, we always have a lot of qualitative results. We called about many of them in the last few slides, but I'd like to bring some quantitative results for you. For example, Google is the father, the creator of the SRE. And Google has reported that SRE has helped them reduce its incident rate by 50% it's too much. And improved these reliability in 99, 95% too much LinkedIn, after adopting SRE, they reduced its incident rate by 85%. A lot of things. And these company also reported that it was able to improve these MTTR by 75% too much. Netflix, everybody likes Netflix, right? Another company that's under a huge pressure, not only because of everything that's going on the market, but because of the signature crises time, because everybody was sharing their signature, but Netflix adopt SRE since 2010 and they reported that it has helped them achieve an availability of 99 99% too much pain. It's too much. And Netflix also reports that it has reduced heat downtime by 9% 90% after adopting SRE. And last but not least, Dropbox. They reduced these outage by 90% after adopting SRE and they reduce the number of incidents by 75%. When we are under attack in a moment where everybody needs to cut in the bone, this kind of reducing would help a company to save a lot of jobs, right? So let's continue were and here I would like to share some data from the market related to state of SRE and why we have a huge opportunity. Take a look of how is the adoption of these SRE. We have 6% of the market that's totally immature in these practice, 32% that's emerging, only emerging. We have 42. That's maturing is too much. If we sum all of these, we have 80% of the market that's not adopting completely the SRE. Can you see the amount of this opportunity that we have here? 80% of the market has the opportunity to save money to improve their operations only by adopting SRE practices combined to DevOps and Et cetera. It's a lot of being. Let's see some other important information regarding the sres, what they are dedicating most of the time doing and how this relates to efficiency. Almost 70% reducing the MTTR it's a lot of things. It's a lot of things when you're spending all this time and it will bring the result. 67 reducing MTTR 60 building and maintaining automation code automation would generate a lot of savings and eficiency and time free to spend and more important activities for your team and your company. Ensuring security vulnerabilities are detected and eliminated quickly. Security is a huge problem for tech companies. So you have this SRE team spending more than 50% of your time. Design experiments, running tests to reduce risk of production failure. Nobody wants to have failure in production. And you can see here all of this information about how you can use your team in a better way. What are the expecting expectations and demands on SRE, what these want to achieve? And which of the following tasks do SRE in your organizations dedicate the largest amount of their time on an average week? The same as we saw before. Reducing MTTR, things like that. Building automating code. It's really good, it's investment. You're investing a lot of time were a lot of effort for in the future you can save a lot of time and your operation of incidents and outages. How does your organization evaluate service level for its applications and infrastructure? That's something that I would like to indicate here. A lot of companies works with okrs and key performance indicators, but these heart of SRE is the slO. So 75% of the companies that answer these state of SRE, they work with slos. And why is that so important? Because slos would really, as we know everybody knows here in this conference, SLO is the key indicator for us to work with SRE. And we have a lot of difficulties working with this. So investing a lot of time in SRE would request invest a lot of time defining slos the right way. We have a huge challenge here in defining and to getting this information, because we have too much information, too much data inside our company and we have to clean this data. Like all data science that we know, we have to work a lot to get this data really good in order to be effective managing our SLOs and the SLOs are the core of the success of the SRE, because with the SLO we will have our error budget and the error budget is what will help us provide a safer environment where we can have some experiments and where we can even get wrong sometimes. So the difficulties that we have creating and defining slos, we have too much data, too much data sources, too many metrics, monitoring tools that don't allow to easily get that slo. So you have to invest a lot of time and effort defining and really getting your slos. Continuing here. Some are really good slos to implement in order to be successful in your SRE. And here we have some slos for the business point of view, even the mobile ones for the business and end user centric, right? That's another buzzword, always availability. We have to measure the engagement, we have to measure the user satisfaction, the conversion of our platform, how is that going? And of course we have our performance slos. How is our utilization, response time, traffic, saturation, success rate? Every one of these slos are really important for us on the technical point of view and for mobile applications. Everyone is mobile today, right? So, app adoption, availability of the app, response time, you should provide a good experience and SRE will help you a lot. Improving this response time, success rate, crashes and of course app rating. Everybody wants to know, how is your company evaluation on the App Store and how would you identify how the companies identify the targets for each of your slos? 26% do that based on end user experience, 24 based on historical data and industry standards and 20% on our system on however our system is doing today. And who in the company helps in defining these slos. The SRE team is responsible for 80 80% security, which is really important and they really contribute a lot for these SRE adoption and for our success. 49% from the business 47, infrastructure 45, DevOps 41, operations platform 36, development 33 and Application 32 were we have some other opportunities here. That's not the theme of this lecture, but as soon as you evolve adopting DevOps and SRE, you should consider evolving to an AIOPs environment using everything that you can to automate the response for everything that happens in your operation and the platforms that you might use would help you provide that. And I strongly believe that with all the advances that we have with chat DPT, that would be a reality even for I would believe that chatDPT and all these new generative AI would increase the adoption of AI ops and last but not least, finops. We have a lot of expenditure unnecessarily expenditure with cloud costs. So we should work in order to keep this on the right way, deploying exactly what needed for each application and for each environment. So we should work to keep the Finops working really well and SRE and DevOps. Right. We are coming to the end of this lecture and I would like to share with you some key takeaways of this lecture that I really like you guys to save in our mind for this year and probably the next one. That should be harder years for us working with technology. Right? We have SRe increased the commitment and morale of the team. We are in a moment where people SRE feeling are afraid. They are not feeling confident, they are afraid of losing their jobs. SRe increase the commitment of their working together and improve their morale. That will be great for your company. Besides the obvious qualitative benefits, as were seen, SRE generates quantitative results. We believe and we should insist inside our companies that we have quantitative results that are feasible and that they generate economic and financial returns to our company. We will increase our agility as a company. We will be faster and then we should invest in SRE. Even for consulting firms, it will be a huge differentiation. You can sell SRE projects and you can adopt SRE practices in your development. So your team will be really differentiate and the end define and work for achieving slos should be the main objective in the adoption. You should work well in defining the slos. That will be the most important thing in these SRE adoption. Right? The SLO will provide you the error budget and everything that you can do will be based on that. So that's it, folks. That's all, folks. For this. Again, my name is Fabio. Here you have my care code for my LinkedIn. It was a huge pleasure to be here. I'm really thankful opportunity of sharing my knowledge and my experience with you guys. And let's embrace SRE and make this world better. Thank you and have a nice year for every one of you.
...

Fabio Alves

MBA Professor @ FIAP

Fabio Alves's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways