Conf42 Site Reliability Engineering (SRE) 2024 - Online

Building more reliable product through SRE community of practices

Abstract

In today’s fast-paced digital landscape, reliability is the cornerstone of success. Elevate your product reliability with SRE communities! Join my talk to discover how harnessing collective expertise drives innovation and ensures seamless user experiences. Let’s build robust products together! #SRE

Summary

  • Jorge Castro: My talk is building more reliable products through SRE community of practices. Connecting people makes better continuous delivery. Please add me to your LinkedIn. It would be awesome if we can keep in touch after this session.
  • In this talk we are going to share our experiences facing the business challenge to build more reliable products to meet customer needs at enterprise level. The idea is that a community help us to foster team collaboration, sharing knowledge and fix the skill gaps.
  • Site reliability engineering is a framework to handle these operation structure. SRE focuses on running systems in production. Another key point about SRE is incident response processes. So this is a kind of summary about what is SRE.
  • SRE needs slos with consequences. SRE must have time to make tomorrow better. And number four, failure is an option. Failure is not something bad, it's something that we need to use.
  • Once upon a time we were working at a large enterprise. We had a global and diverse teams involved in continuous delivery. But we noticed that we suffered a lack of collaboration and sharing. So we decided to apply community of practices cops.
  • We decided to create our SRE site reliability engineering community of practices. Our second thought was about that learning experience, right? The experiential learning and learning by doing or walk the talk learning. Here are some ideas about how to implement this approach in your teams.
  • SRE Cop gives tips on how to create a successful SRE community. Focus on the purpose, audience, goal and expectation of the community. Please be sure to align community goals to business goals. Check topic by topic and share your experience.
  • SRE cop can help you to build, enable and develop SRE that replace capabilities as part of your business goals. To foster team collaboration, experimentation, outcome base, the people inside the community is going to get that. Your business grows as your communities and people grow.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, hello. My name is Jorge Castro. I work as a transformation leader. I have some experience in DevOps SRE quality and Agile world, working at large enterprises and also with different kind of programs, it testing and so forth. So I'm very happy to be part of the event. I hope you can enjoy my talk. Okay, the name of my talk is building more reliable products through SRE community of practices. Connecting people makes better continuous delivery. So basically during my talk I'm going to share with you real experiences. Helping and working with my clients in the challenge to build more liable products and services. The SRE community of practices help us to reach that goal. First of all, my introduction as I mentioned before, my name is Jorge Castro. I work as an agility, agility DevOps software engineering test, data transformation leads coach, agile coach as well and program manager. And also I am very lucky because I had the chance to be a speaker and keynote speaker in some events about agile testing and so forth. So you can see here my contact information and also my LinkedIn account. So please add me to your LinkedIn. It would be awesome if we can keep in touch after this session and we can build community and share experiences. Sharing caring is my mindset so I believe in that. And actually that is the reason that why I'm here, sharing knowledge. Okay. Basically in this talk we are going to share our experiences facing the business challenge to build more reliable products to meet customer needs at enterprise level. As you know when you work with teams or team of team levels in different companies, right. The size is quite important if you are going to talk about large enterprises. So my experience is basically helping these large enterprises to build more reliable products, designing and building community of practices, in this case, SRE community of practices. So the idea with this is that when you have this challenge to build more reliable products, you are going to have several obstacles like lack of team collaboration and so forth. So something key here is that a community help us to foster team collaboration, sharing knowledge and fix the skill gaps and also promote the hands of work while we promote that walk the talk culture, right. In different situations. So yeah, that's part of the story. We are going to share with you our real experiences from natural inches, dealing with these bottlenecks with software development teams. Let's take care of the basics. I assume that most of you know what is SRe and so forth. But anyway, I think that we need to start with the basics. Okay. What is site reliability engineering? It's a framework, right? It's a framework to handle these operation structure, to manage the reliability of the products in production. SRE is what happens when you ask a software engineer to design and operation functions. SRE focuses on running systems in production and basically this work is made by development teams. So another approach or another concept that is quite important in SRE, it's about service level objectives that slos, which are basically these agreements about the expected reliability and availability of products, services in production. It's a kind of agreement between development teams and operations and also our customers. It's quite important for SRE purposes. Another key point about SRE is incident response processes. Because of we are talking about production, as you know, in production we have a lot of situation, right? Especially bugs or incidents, you know. So SRE is also about these old processes, about catching, finding, sorting out and improve the root cause of issues in production. So yeah, it's a very important topic in SRE, this incident response processes. So this is a kind of summary about what is SRE, about this framework. So I hope that basically this can help us to align our main knowledge about SRE. Okay. About the principles, which are quite important. You know, principles are important in any kind of framework or methodology or mindset. Number one, SRE needs slos with consequences. Consequences, yeah, that's quite important. As I mentioned before, SLOS service level objectives, which are agreements about the expected availability and reliability of other products. And those agreements are between or among, sorry, development customers and operations. So yes, if you don't achieve some specific SLO, there are some consequences about the service, about the quality of your product, about the trust with your clients, about how trust you are in terms of your product, the quality of your products. Number two, SRE must have time to make tomorrow better. I think that is quite important because as any other kind of framework maybe has some similar root cause or roots or origin in link mindset, or maybe a framework that has some common things with Agile DevOps and so forth. SRe also continuous improvement, you know, be ready for that and analyze metrics, analyze processes, analyze what is happening in the end to end of software development, to make better products in the future, in the near future. So yeah, that is a very important principle which is aligned to gai saying mindset or this continuous improvement mindset, right. To improve the operation of your products in production, the quality, the availability and the reliability. Of course, SRE teams have the ability to regulate their workloads. That is quite important, right. And actually I like to relate this number three topic principle with the cognitive load approach, right. SRE teams, as any kind of teams, should be able to regulate and manage, in a good way, the workload to avoid the famous cognitive load. Right. And to avoid. To overload the work and produce a negative impact in the ways of working. And that quality and availability, of course, that is quite important. And number four, failure is an option. Sorry. Failure is an opportunity to improve. That is quite important. And actually this is part of a, I'm not going to say a new mindset, at least a good mindset that we need to sell and we need to foster in our development teams, right. That failure is not something bad, it's something that we need to use and we need to get, if we want to be masters of something, if we want to get the best quality of something, if we want to do better and better in each iteration. Most probably failure is the paths. So, yeah, so it's quite important that we need to foster this kind of mindset in our teams. And actually this number four principle, I relate this topic to the psychological safety approach, right. Because it's about feeling well about fail, but having the idea that you need to take the best from that failure to improve the future, right, to improve your product. So, yeah, that's quite important. Okay, so now after this alignment about basic topics in SRE concepts, we can talk about a real story, right? A real experience. So real life, real life business, right? Dealing with customers, with developers, tester production, with issues and so forth. So, yeah, once upon a time we were working at a large enterprise, an IT large enterprise, something that happened in that company. Is this, right? We had a global and diverse teams involved in continuous delivery. Yes. We had people from Latin America, from Europe, from different countries as well. So more than maybe 600 people working with different products, moving code to production, maintaining coding, doing quality and so forth. And of course, people, software is about people, right? I think that is key to understand if you are in this business. So in our teams, we had different people with different cultures, time zones, skills with working and people from different roles as well. So that is something quite important if you want to develop whatever practice or whatever enterprise capability in your company. My first advice is to understand the reality of your team, how global your team are, and also how diverse your team are in terms of technology, locations, time sum, skills, ways of working and so forth. So that was part of our situation with this company. In that company, we had some problems with our mobile products, with our web applications and actually with our in house applications. Those were the problems after goproduction. We faced that, to be honest with you. And also we face another problems like we had not reliable product services. You know, we had problems in production, so our products were not reliable. Low product. Sorry. Low product availability as well. We had some times that our products were not able to. Lack of enterprise capabilities. Yes. We had only a few people with strong capabilities in as example, SRE, right. We didn't have a pool of engineers with SRE capabilities, some of them. So yeah, it was a problem. Low organizational resilience. I think that is quite important, right. Because if you don't have that, most probably when you, when you face a kind of change in your architecture, infrastructure, platforms and so forth, you are going to suffer a lot of pains during the change, of course. And finally lack of collaboration and sharing. Right. We had some people that they knew the business officer e. They were very technical as I mentioned. Right. We had people from different countries, different. But we noticed that we suffered a lack of collaboration and sharing. People were working together. It looked like we work to different companies. So we didn't share goals. Okay. That was the context we needed to change. Right. And as you know, change his heart, as Nancy Heart said, but not changing is worse. I totally agree with her. So we decided to change, of course, because of the situation before that I explained before we decided to apply community of practices cops. I like this concept, this meaning about from Etienne Wenger and Beverly Wenger. Regarding to them, cop is group of groups of people who share a concern or a passion for something they do unlearn how to do it better as they interact regularly. I think I like this idea because at the end that is the approach that we wanted to sell to our client, right. To our company. Community is about people. So people taking care of a problem, right. A business problem, a real business problem. And communities, people learning together to improve something. Right. Have fun and interact. Right. In a positive, in a proactive way. Okay. About this first approach, okay. We said, okay. We have the, these commutative practices, we have these challenges about SRE and our reliable products, lack of collaboration, lack of low availability and so forth. We had our first thought about it. Number one, our community should help us to building ways of working SRE waste working, foster experimentation. Yes, because we noted that most of the problem that we had is because didn't want to try SRE practices or DevOps practices or new tooling and so forth. So yes, we had to foster experimentation also, which also critical topic is about collaboration, right. As you know, the most important asset in any kind of company, more than software, is people and their skills and their knowledge. And if you want to develop this kind of capabilities through your entire organization. Collaboration should be part of your DNA as a company. So that was part of our thoughts that we were looking for this community and finally build outcome center planning. Very sure about designing a community, not only for, you know, bringing people together and share stuff. We wanted to impact business, right? Make people design, run the community, look for the results to get impact in our outcomes. So that was part of the approach. We said that we decided to create our SRE site reliability engineering community of practices. Our second thought was about that learning experience, right? The experiential learning and learning by doing or walk the talk learning, right. The idea was to, we need to learn new stuff. We need to prepare people to learn more stuff, build new capabilities, SRE capabilities in our community, through our community. And for that purpose, we follow this approach, this learning by doing approach. First of all, concrete experience. So basically, in our community, we shared real experiences, real situations, working with Sov, problems with our clients, we have a reflective observation on the experience. So basically, we analyze the good things, the bad things, the context of the experience, the metrics involved, the people involved, and all the situation, because we consider that we need to get that experience from this kind of shared knowledge from our community. And then we went to that abstract conceptualization which was concluding and learning from experience. So basically, it was okay about this situation, new situation, new skills, new practices. I analyze the context, experience and the metrics and so forth. So I conclude with some ideas about what are the best movements to implement this approach in my teams, maybe run some workshops, promote some gamification approaches, um, move some mentoring and coaching, and finally, learning by doing right, active experimentation in this point, something that which is key is about psychological safety, or basically press in your experiment, don't feel, you know, panic about failures, and do the experiment right and do it right. That is the most important part, of course. Do experiments in, you know, maybe small contexts, and then if that works, you can escalate the solution. Of course, that was our approach for learning my doing. Okay, about the team, right. I think it's a quite traditional team. You know, we have our community with different engineers from different countries, business units and so forth. And we have a core team inside the community. You know, the core team was in charge to design, to facilitate and organize at least the first sessions and the first steps of the community, because our purpose was to rotate this core organization team. So anyone in the community could have the chance to organize some sessions of the cop. We have the leader, which is basically the guy, the person in church, to lead all this approach, deal with the upper managers, with the stakeholders, with the, with the other communities, to design and to foster the best practices inside the community and drive the community in terms of value, impact and the best for their practice and its development, you know. So yeah, the lead is a very important role. And as part of this approach, we had the backlog of the community, you know, with all the challenges that we wanted to develop, sort out with our community, you know, lack of some skills, certifications, some business implementations, some SRE customer challenges and so forth. That was part of our cop backlog. You know, the gaps about our current capabilities in terms of SRE. As part of that, we also have okrs. Our community of practices, our SRE community practices. We have some okrs. Okay. Something that was quite important was learn from the past, especially from the failures. And as you may know, this approach to creating sorry community was not the first approach to create a community inside the company. So that is why learning from the past was quite important in our experience. So about this topic, please be sure that you understand and share this voice with your team members, with your stakeholders and so forth. CoP is an investment. So it's an investment. Investment of time is investment of talent and so forth. So you need to, you need to handle this approach in that way. It's an investment. And then SrE cop aligned to business strategy. That is quite important. You need to understand what are the business contexts, the business challenges. So with your SRE community of practices, move your okrs inside your community, produce impact to these business goals. Right. The business challenges are going to be more products, velocity, quality, reliability, win more clients and so forth. And I'm pretty sure that SRE Cop can help you with that. I'm pretty sure about it. Okay. Some examples about OKR, about okrs that we designed in our community. Number one, improve your reliability and availability. Okay. That was one objective. And as an example, key results achieve an x percent reduction in a number of incidents impacting production services. Another example, number two, improved team collaboration. Key results, launch x cross functional workshops or hackathons with global groups from different teams. Right. And number three, increase SRE enterprise capabilities. It results increased participation in SRE related training courses or certification by x percent within the community. Okay, those are examples that we use in our community. You can add more, you can choose a different ones, but basically please remember that depending on your business challenges, depending on the business strategy that you are aligned to, you need to define your okrs. Okay? Now this is a very, very important tool that you can use to design your community. This is the minimal viable community, the MVC. And as you can see, as you can see here is a canvas that helps you to design your first approach of SRE community. Actually, you can use this for any kind of community, but in this case we use that for community. So now we are going to check topic by topic and we are going to share our experience about that. Okay? Number one is the purpose. In our case, our purpose was bringing together experts and enthusiasts, sorry, enthusiasts, to share knowledge, skills and experiences related to improving the reliability and performance of digital services and build doers culture. That is quite important, right? Because more than bringing people to work together to share knowledge, to help each other, also we want to make builders, right. We want to build doers, doers, that at the end, they are the ones to create impact through experiments, through trying new stuff and to deal with real, real problems in production or in business. So that was our purpose for our community. Number two, the audience. Well, basically the audience of the community where our SRE engineers, developers, devopsrs, operation engineers and so forth, right? All the people involved in end to end software development, production development, they were our public, our team members in the community. Number three, both values we promote the values of sharing knowledge, experimentation, collaboration and outcome base, which was quite important for the success, for the future success of our community. Number four, the goal, right. Well, the okrs that I showed before, they are examples of the goal. Please be sure to align to the transformation and business goals. That is quite important that you align your community goals to that transformation approach that you are doing in your company and your business goals. About. Number five is quite important is that expectation, you know, and basically it's about the community member experience, right. You know, we had in the market developer experience, sorry about that. That was Alexa. So we have in the market, sorry, we have in the market customer experience, developer experience. And basically this topic is about community member experience, which is a function of reality and expectation. And that is quite important because we said before that the community, the SRE community is an investment. We said before that you need to align your SRE community okrs or goals to your business strategy and you need to have outcome based approach inside your community and about all activities that you are going to do, training sessions, workshop and so forth. So that is quite important as well. Your team member, the people that is going to be part of the community are your clients. So you need to take care of your clients and you need to take care about why they thought about the community and what they are expecting from the community. It's a key topic, right. So for that approach, for example, at the beginning of the community we ran these kind of feedback loops and we got this feedback from our SRE engineers with our former team members. Very interesting, right. As you can see here, basically people is saying that they don't want from the community more PPT or more talks. They want real experiences, hands on approaches and also they wanted to know more real failures are real victories or success stories in SRE projects. That was quite important for us, especially for design work, our community. Okay, number six, the rules. Basically, as I mentioned, we had the cop lead the core team, you know, number seven, the rules basically is about the schedule, the participation, core team agreements and so forth. You know all, you know, it's about, it's about all the topics, you know, all the topics that you need to set up with your teams in terms of the function, the operative function of your community. Number eight, goals, how to prioritize backlog okrs updates, learning initiatives, decision making and so forth. That is quite important, right about, sorry about number nine, communication basically are the channels to communicate inside your community. Slack teams, internal social networking, etcetera. Okay, a very important topic about these are the metrics, some metrics recommendation. Well, basically three, we recommend that you use the metrics of shares and collaboration. Basically how your teams collaboration is collaborate with your royal teams. An indicator of that, the number of experiments, cassian experiments for example. And finally the outcomes, they are quite important. The quality, the speed, the savings, the reliability that you are reaching because of your community and its operation. How do we make cop last longer and more engaged? Yeah, I think that is a good topic because we noticed that in the previous cop approaches. The cop at the beginning was strong, but after some iterations it disappeared. So we wanted to change that. And basically for that approach to make larger, long, larger communities, we apply this, the minimum enjoyable game. So we applied gamification, right? We combined some gamification approaches with lean setup approaches to design the most valuable and simple games inside the community to foster collaboration, learning and so forth. So make our team members enjoy the experience. For that approach, I recommend you to use this framework optalysis. It's a game design and human design framework. Very useful. And also as part of that we create this game right inside the community. We create that reliability leak game which is basically a combination of game design, Optalis framework and human design approaches. And also link strap in. This game was quite simple, right. We have the people with strong skills in SRE who were the Batmans inside this game and each Batman has the psyche, right? The SrE psyche and those Sre psyche were the juniors or the developers that need to develop SRE capabilities and so forth. So is Batman worked together with the psyches and the Batman do whatever they need, whatever she needs to do to create more heroes, you know, to develop the sidekick and move their, practice their skills to another level. It was very funny, you know, we had a lot of backmans, we had a lot of sidekicks, robins. It was very funny to work with that. Finally what we achieved a lot of things, I think increased number of experiments. That was quite important, you know, do more experiments in a company, improve services availability. Of course we improved that metric. That actually was a pain, was a real pain in our business. We improve our turnover, right? Because with this kind of approaches, gamification community people feel different, right? This kind of learning, they feel motivated to share with their mates and have fun. Through navigation, it helped us to improve the turnover rate and finally the developer experience. Yes. When we ran some feedback loops about the NP's of the sessions of the community, we got very good results about the experience of our developers. So finally some learned lessons. Rotate the cop core team. That is quite important. Please try to more people can have the responsibility to is to facilitate different sessions. That is quite important. You are what your community is. Yeah, that's true. So if your community foster team collaboration, experimentation, outcome base, the people inside the community is going to get that. So please be sure about it. Your business grows as your communities and people grow. Yeah, that's quite important. If you can impact your business, I'm pretty sure that your community is going to grow, not only people and maybe also in budget and more resources. So yeah, and finally cop improve developer experience. Yeah, that is quite important. So if you are, if you're facing some leavings of developers or some bad numbers in terms of developer experience, I recommend you to use cops and also gamification for that approach. SRE cop can help you to build, enable and develop SRE that replace capabilities as part of your business goals while building social and technical learning spaces where people benefit and have fun also, right, people and business oriented collaboration inspires people to become doers. And those doers they make possible to build reliable products. So finally some books that I recommend, those are really nice books that I can recommend you. You can search for them on Internet. So enjoy, enjoy them. Finally, please remember, don't forget we have dreams. So help and share more. Sharing is caring and maybe also have fun, continuous fun. So that's it. So I really appreciate your time. I hope you enjoyed talk. Thank you very much for your time and please reach me out after the session and add me to your LinkedIn accounts.
...

Jorge Luis Castro Toribio

Lead of Transformation Strategy @ NTT DATA

Jorge Luis Castro Toribio's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways