Resilient Systems, Resilient Teams: A CTO’s Blueprint for SRE Excellence

Video size:

Abstract

Learn from Eugene Korneev, CTO at MyGig, as he unravels the secrets to building resilient systems and teams. From educational platforms to cloud gaming, discover a leader’s approach to SRE, blending architecture, security, and team dynamics for unbeatable system reliability.

Summary

Evgeny Korniv: How to assemble and organize a development team in the way that enhances both the resilience and the quality of your product. We will discuss whether SRE is necessary for small and medium sized teams. Talk is addressed more to engineering team leaders than engineers themselves.
SRE is not about creating a team of on call engineers. It is a set of fundamental principles that can be implemented right from the start. Understand what level of availability suits your product. Adopt modern approaches to DevOps. Try to automate the processes of code compilation and deployment.
As your product grows, so does the number of the different subsystems and components. With increasing complexity, the amount of time required to identify and resolve failures also increases. One obvious solution is to appoint on call developers from existing teams. But to prevent a week of on call duty from becoming atonement, don't overload them.
For SRE, there is no strict criterion, but I can highlight several points that you should evaluate. How strict are the availability requirements for your product? How heavy is the load of the service or how popular is it? Can you afford to have a staff member whose peak activity occurs only during major outages?
Communicate in shared chats. Try to ensure your employees discuss work related things as little as possible in private messages. Organize all team meetups. Document at least the most important and complex part of the system.
Every team should have at least one specialist who has significantly stronger than the average market level senior. An experienced and skilled developer not only completes more and better tasks, but also has extensive problem solving experience. It's important to understand that achieving high availability for your product requires efforts in all directions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Well, I hope you having a wonderful day. My name is Evgeny Korniv and today I would like to talk with you about how to assemble and organize a development team in the way that enhances both the resilience and the quality of your product. Let's begin by looking at the topics I want to talk about in my brief presentation. I want to clarify that this talk is addressed more to engineering team leaders than engineers themselves. We will explore the lessons I learned from building product teams that developed high load services. We will discuss whether SRE is necessary for small and medium sized teams. We will consider the obvious alternative to creating an SRE department, the developer on calls of duties. We will try to determine at what point a company needs to hire an SRE engineer. We will also touch on how crucial communication and people are in implementation of SRE principles. Okay, let's start. When it comes to small teams, it's crucial to understand that SRE is not about creating a team of on call engineers who will save your service day and night from crashes. It is a set of fundamental principles, principles that can be implemented right from the start. If your product is already in production and you are even slightly concerned about its availability to users, start by defining an SLA. Understand what level of availability suits your product. Define for yourself what availability means to you, what in your product must fail to be considered differently. Down set up monitoring and alerts. According to this, inform your team about the target metrics they need to respond to these alerts, and they are important to the business. Try to automate the processes of code compilation and deployment right from the start. The fewer manual actions, the fewer mistakes. Ensure your code is covered by tests. Teams that overlook testing in the rush to launch their product quickly are still common. Sometimes this is justified, but often having tests not only enhances stability, but also accelerates further development. Adopt modern approaches to DevOps. Availability issues often fall on the shoulders of DevOps engineers, so make sure the production infrastructure you create for your product is resilient by itself. Limit access to production systems, especially databases. In the ideal scenario, no developer should have such access, only the operations team. This will limit potential leaks as well as incidents related to unauthorized data changes. This seems like an obvious point, but many small teams overlook it. Create tools for servicing your service, dashboards, admin panels, feature toggles, etcetera. This will help you achieve the previous point. As my experience shown, these simple principles are sufficient at the beginning of the product's journey, as it has not yet turned into the complex and multifaceted system. Ok, let's talk about on call duties. As your product grows, so does the number of the different subsystems and components. With the increasing complexity of your system, the amount of time required to identify and resolve failures also increases. One obvious solution may consider is to appoint on call developers from existing teams. This means an employer works as usual, but in case of failure, they are first to respond to critical alerts. It is usually assumed that such an employer prioritize work slightly over life. In our beloved work life balance during the on call period, this approach presents a couple of pitfalls that create an interesting dilemma. On one side, if the on call person only responds to alerts about product and availability, then during periods without incidents, developers can become too relaxed. Everything is stable. I can go to the pool at launch. It's unlikely anything will happen, and as usual happens, that's exactly when something does happen. You could assign more duties on the call person so they don't become complacent, such as responding to all production bugs. But then each on call period becomes grueling, leading to rapid burnout. I strive to care for the mental health of my employees, so I opt for the first approach to hedge against risks, I use an on call roster. If this week's on call person is somehow unavailable, the next person on the list responds, and if they also unavailable than the next. Moreover, anyone who has to respond out of term get moved to the end of the list. This delay dissolves as a pleasant reward. The week is ideal duration for an on call shift. Our team tried rotating on call duties daily, but when a service fails, facing the incident analysis to the next employee becomes inconvenient. However, to prevent a week of on call duty from becoming atonement, do not overload on call person with additional duties. They should only respond to the alerts you defined as significant and disruptive to the normal operation of service. All other incidents will follow your standard workflow. Okay, let's discuss when you might need to assign a dedicated role in your team. For SRE, there is no strict criterion, but I can highlight several points that you should evaluate to make this decision. How strict are the availability requirements for your product? For instance, if you are operating a fintech service, these requirements might be higher since you are likely handling user funds. Second point, how heavy is the load of the service or how popular is it? Failures in popular services leads to more significant reputational losses than those with a smaller customer base. Next point, how well is your development team performing, both in terms of productivity and the quality of the product released? If on call duties are time consuming and you can't find a way to improve the situation, this might be an important signal. Next point how complex is your system? As your product and team grow, the overall complexity of the system increases and on call shift alone no longer suffice. Developers begin to know less about different parts of the system and less independent in handling incidents. The last point what does your budget allow? Can you afford to have a staff member whose peak activity occurs only during major outages? As experience shows, the first SRE engineer in the team is usually a former developer within the same team, but the vacancy created still needs to be filled. So watch your budget. If after assessing this criteria, you conclude on at least three of them that you need an SRE team, then it's likely true. However, even before that, understand whether you can fix some issues with other methods. Could you implement a coverage plan for testing to enhance quality or refactor some services to reduce the complexity of the system? Okay, let's talk about communications. To ensure your team performs well in creating a resilient service, it's important to keep them informed within a unified information field. For the first, communicate in shared chats. In our company, each development team has a chat where employees discuss tasks and technical solutions. This allows all members to see current issues, decisions made, and their impacts on the system. Try to ensure your employees discuss work related things as little as possible in private messages. Organize all team meetups. At these meetups, employees can talk not only about technologies, new technologies, but also how they solved major tasks and outcomes of those solutions. Team leader meetups can be used to discuss team interactions, major system design decisions, and how collaboration can enhance the product stability, documentation, and knowledge transfer to ensure that system support quality is maintained in the event of staff rotation, document at least the most important and complex part of the system. Make sure this documentation is included in the onboarding process for the new employees and after their probation period, check how well they have understood it. A bit about people I truly believe that every team should have at least one specialist who has significantly stronger than the average market level senior. An experienced and skilled developer not only completes more and better tasks, but also has extensive problem solving experience. When we talk about SRE, we often discuss readiness for various types of problems. Specialists with over ten plus years of experience have encountered these problems far more often than recent graduates, even from the best universities. And believe me, this experience will more than once save you money and possibly even your business. Even if your budgets are limited, it's better to sacrifice an extra position in the staff, but hire one experienced and costly senior specialist to finalize my speech. It's important to understand that achieving high availability for your product requires efforts in all directions. So let's summarize what I have discussed today. Determine your availability criteria and SLI conduct improvements in alerting and monitoring. Ensure maximum transparency in the team communication there is no room for secrets in an engineering team. Improve code quality testing code reviews quality control CI CD this will be all fundamental an on call duty policy. Have the project grown. Assess the need to create an SRE department. Develop an SRE engineer within your company and if necessary, build an SRE team around them which they will train. This is challenging path, especially for teams that already have an established culture, but it's worth it. That concludes my presentation. I hope you found it interesting and that it will be helpful in the future. If you would like to discuss any topic, this or other, feel free to contact me my contacts on the screen. Thank you and have a good day.

Slides

Download slides (PDF)

See all 26 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2024 - Online

May 09 2024

Resilient Systems, Resilient Teams: A CTO’s Blueprint for SRE Excellence

Video size:

Abstract

Summary

Transcript

Slides

Evgenii Korneev

CTO @ MyGig

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering (SRE) 2024 - Online

May 09 2024

Resilient Systems, Resilient Teams: A CTO’s Blueprint for SRE Excellence

Video size:

Abstract

Summary

Transcript

Slides

Evgenii Korneev

CTO @ MyGig

Join the community!