Conf42 Site Reliability Engineering 2021 - Online

Enterprise SRE adoption framework

Video size:

Abstract

This talk is about a new enterprise SRE adoption framework, named Arctic. Given the growing focus on infrastructure and service/application reliability, more and more enterprises are adopting Site Reliability Engineering (SRE). It will be beneficial for enterprises to use a framework for SRE adoption like Scrum, XP or Kanban that exists for Agile adoption. Without the availability of framework(s) to help in adoption, it will be challenging for enterprises as they need to spend a lot of effort upfront to understand how to go about the SRE adoption and do the planning before they begin the actual journey.

This talk includes the following things. - The two pillars of the framework - Other frameworks/concepts that can go hand-in-hand with this - What to look for when hiring SREs - both in terms of personality types and skill sets - A way to do the goal setting for the transformation.

It is to be noted that as on the date of submission for this talk, this framework has not been used in any enterprise and has been conceptualised very recently. The hope is to seed the thought around frameworks for SRE adoption, present the current version of this framework to larger SRE community, gather feedback and start the usage of this framework by enterprises.

This talk suits various audience - who have already started their SRE journeuy, those who are looking to start on it and even those who are still exploring to understand more about SRE.

What is the problem that I am trying to address? Currently, there is no standard framework for SRE adoption similar to the frameworks like Scrum, XP, Kanban, etc that exist for Agile Adoption. Having a standardised framework will eliminate quite a bit of upfront effort thinking about “how to adopt SRE” at enterprises.

Summary

  • Vishnu Vardhan Chikoti is a senior manager SRE at Fanatics Inc. He has 16 years of experience across site reliability engineering, product development and business analysis. Arctic Arctic is a new SRE adoption framework that has been conceptualized by him. Can enable your DevOps for reliability with chaos native.
  • Arctic is a framework which tries to set the basic structure, and a framework for SRE adoption at enterprises. The two pillars of Arctic are visibility and accountability.
  • monitoring nowadays with very complex architectures. Observability by itself has three pillars. It has traces, logging and metrics. Also error budget policy next and incident response. Capacity planning is about how are we planning for the infrastructure needs on a normal day.
  • Security best practices nowadays there sre so many security incidents that are happening. To do all these practices, there is a need for having tools and platforms in place. What tools are being used and how effectively are they being used?
  • There is an option to actually split SRE by function. Like have an infrastructure SRE who focuses on infrastructure. And then there's a concept of embedded SRE where there can be a central SRE embedded into the product engineering teams. The whole culture principles and the policies procedures need to be standardized.
  • SRE is a broad role which includes knowledge from engineering and operations. Framework is agile. Design thinking can be helpful. Various personality types are required for a successful SRE transformation.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRe? A developer? A quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud hi, my name is Vishnu Vardhan Chikoti and in this talk I am going to introduce about Arctic Arctic is a new SRE adoption framework that has been recently conceptualized by me to help with SRE adoption at enterprises. About me I have about 16 years of experience which is a diverse experience across site reliability engineering, product development and business analysis. For the initial part of my career I was in a product development business analyst tech ba kind of roles and then I did a career pivot towards the site reliability engineering area. Currently I work as a senior manager SRE at Fanatics Inc. And prior to fanatics I have worked with Broadridge bank of America, Tetora Consulting and DBS Bank. I also co authored a book by name, hands on site reliable two engineering and it has been published very recently in July 2021. There is a blog that I hold, it's xfgeek.com and it has content across capital markets which was my initial part of my career. There is technology and then agile. You can look up if you're interested. And from a location perspective, I'm in Hyderabad almost all the time from the last 20 years, maybe like few months. I am not in Hyderabad but otherwise I'm always in Hyderabad and coming to Arctic I always start with a question of why and I would like this talk with also a question of why. So why do we actually need a framework for SRE adopting now when it comes to SRE, there sre different views on what is SRE. For some people SRE is about availability, for some others SRE is about golden signals. SRE is about automation for operations, SRE is about infrastructure automation or SRE is just a new title for production support analyst. The list goes on but I have given few examples here for some of them and then there SRE different questions for SRE. How is SRE different from Itil? How is SRE different from DevOps? How do we structure SRE teams? Is capacity planning taken care by sres as it is already done during PNV testing? I have already done those capacity planning in PNV testing now what is that sres are going to do at a later point? Can we have multiple slos for the same service? Is it fine if we just measure slos for our critical services? So this list of questions also goes on and on. These are like few examples which are there. Now to answer some of these questions and to correct some of these views, there SRE books, there are videos, there are blogs. Now we will see a frameworks. Now, what is a framework? The concept of framework is not new. We have frameworks like Springboot for Java, flask for Python, and then for agile adoption. Also there's a framework called scrum. Now if you see the definition of framework, what is a framework? Framework is a basic structure, a foundation that is set on which we can build on. Now this is where Arctic is. It's basically a framework which tries to set the basic structure, and a framework for SRE, a foundation for SRE adoption at enterprises. Now hello Arctic. Now we will look at what Arctic is and what are its two pillars. Are now the two pillars of Arctic are visibility and accountability. So these are like two key things that are important for transformation to look at so that we have a successful transformation. So what is the visibility required on? So SRE is about practices, the tooling platforms and policies or procedures. SRE is also about culture and SRE is also about principles, but that is not explicitly mentioned as part of visibility in this framework. So it's like culture is like implicit. Without the cultural change, it cannot happen. And also the principle should be understood now when it comes to the practices. So SRE has like lot of practices under it, and there's no need to probably boil the can to go and start on all of these practices on day one. Now there can be an exercise that can be done to look at what practices sre already in place. It's a natural thing that some of those might already be practiced in the organization, either because as part of their product engineering standardised, or it can be part of their other frameworks like ITIL. Now, what is like monitoring? So monitoring nowadays with very complex architectures where we have cloud infrastructure, vms on that, then there are platforms bit on top of that. Sometimes it's not even just directly deployed services, there are other services which are deployed there and it becomes a SaaS and they are consumed and there is cdns there, SRE containers, those are auto scaling environments, there are like DNS. So those are a lot of things. Like before even a request leaves the browser or the device of the user and then hits those production services and then returns back to response. Now at what level is the monitoring? Is there end user monitoring? Is there infra monitoring? Is there APM, is there database monitoring? So there are a lot of things within monitoring. Again, then there's observability which is the actual data, which serves a purpose of monitoring. Observability by itself has three pillars. It has traces, logging and metrics. Now, one of the difference that I like between monitoring and uploadability, which is said is monitoring is the microscope and observability is the slide under the microscope, which gives the clarity. Then it is about slos. Like do we have defined slos and does the team actually understand what slos are? And then at what level are they defined? Then measured suits and error budgets. Once the slos are defined, are we actually measuring the suits and do we have error budgets in place which are also measured? We'll also talk about error budget policy next and incident response. So when an incident happens, like how is that getting notified, how are the terms coming into action, how they are trying and how they are resolving all of that is incident response like incident management, how is the communications happening? Are we informing the stakeholders or which group the users informed? So all of that. And then how is the severity and priority determined? So a lot of things under incident management perspective and postmortems. So like postmortems probably might be done, but from an SRE's perspective, it is important to do postmortems in a blameless way. So it's not a blameful game on who made the mistake, but it is about how did it happen and how can we avoid it in the future. Change management, a change cannot necessarily be a code change. It can be a configuration change, or it can be any other change. It can be patching upgrades. It can be anything. Now, how are these changes actually being done, how are they communicated, how are they being approved, how are they being validated? So there is so much that goes into change management. Release management is about how are the releases happening then how sre the deployments, sre it blue green, canary, what are that? And eliminating toil. Toil is basically the manual, repetitive and work that can be automated away. And how much of toil exists is the toil being tracked? Sre those efforts to automate that, and at what level are they being automated? Capacity planning is about how are we planning for the infrastructure needs on a normal day? How is it going to handle in a peak day or a high volume day? Like if a high volume happens out of unplanned, how is it going to handle? Do we have elastic environments in there? So those kind of things go there. And infrastructure automation, how is the infrastructure being provisioned? Is it manual again? Is there automation in that? Aops nowadays with data and machine learning and all of these modern tools available, like it's not only about you need not do everything from the scratch, but there are libraries and frameworks available even in that space that can be used now with aops, what we can do, we can do things like autoremediation, we can do things like alert correlation, it can be other areas as so then chat hubs. Nowadays it's all about chat tools like slack terms or any other telegram or WhatsApp, take any chat application. And are these tools being used efficiently where the information is being sent over to the operations or sres through chat? And can sres actually take some action directly from the chat window? Then again, with the modern complex infrastructures, how confident are we on our own infrastructure and services? Can we actually handle failures that are unplanned or that are unknown to us? So that's where has engineering helps in to simulate some of those scenarios and fault injections and then explore the weaknesses and fix them. Security best practices nowadays there sre so many security incidents that are happening, and it is utmost important that the customer data or the company data, or the organization data or the services, everything is kind of protected, whether it's with ddos or any other thing or any breaches, anything there, and regulatory standards. So depending on the type of the business and the type of the market, where the business is actually happening and the type of products and all that, there are regulatory standards and that need to be followed and how is the compliance with those regulatory standards. So in this case, from an SRE perspective, it's more about technical standards, it's not really about any business related standards then tools and platforms. So to do all these practices, there is a need for having tools and platforms in place, like for monitoring, we need dashboarding, we need visualization, we need tools that actually ship data, then tools that help in transformation, tools that help in storage. So there are a lot of tools. Similarly for observability, there are a lot of tooling that is required combined to achieve both monitoring and observability. Then there are also frameworks and libraries like open tracing or open telemetry that can be used for tracing and alerting. Like how is the alerting being done? And same with on call management. Like how are the on call person being reached? Is it automated or is it manual? So automated through what? Tool alert correlation. So now there can be a number of alerts caused by the same underlying problem. So these alerts are already being correlated, so that you finally have one single incident out of that particular set of alerts. For example, a data center hosting 100 vm SQL is not available, then all of those will start saying like okay, this is not reachable. So things like if there is a network problem in a particular area, again like the entire region will have problems. So how are those being correlated? So runtime platforms like nowadays it's all about deploying services as containers or on platforms like Kubernetes, Openshift or pivotal cloud foundry. So there are various platforms and then there are chat applications like slack teams that are actually used to community as I previously stated, and then ticketing. So when an incident happens or when a change actually has to happen. So how are those tickets being created? Is it again automated manual? So it's not always possible to automate. So what extent of automation is already available and self healing. So in order to do auto remediation or self healing, there are many tools now available and some of them need to be integrated with in house monitoring tools or alerting tools and to what extent it is being used. Then CACD tools are required to take care of releases or source control merges builds things to SRE artifacts. So there are a lot of tools available and what tools are being used and how effectively sre they being used. And again, there are tools required from a change management perspective, there are tools that are required from infrastructure provisioning perspective there is backup and recovery. How often backups are happening and how effective are there and how soon can a backup be restored when it is required and again, at what extent is it automated and at what extent is it manual? Then about patching. Like patches are always there, like whether they are security patches or OS patches, any other upgrades, end of life, end of support related. So there sre a lot of patching or updates or configuration that will be required at what extent this is also automated and there are use cases around natural language understanding like for example chat applications. Now can an SRE just type in a command, please restart this XYZ service. Or it can also be said in a different way, please reboot or please bounce XYZ service. The intent is the same. It's all about rebooting that particular XYZ service. And can the chat application actually understand that particular command through NLU and then fault injection? Fault injection is actually useful for chaos experiments. There are tools that are available to inject faults at a network level, tools to inject faults at a VM level, at a platform level like kubernetes. So it can be done at various levels depending on the type of infrastructure in an organization. Again, all these tools and platforms need not be like one tool because there is no one size fits all. So depending on the type of infrastructure and services and businesses that are there, there can be different set of tools that are actually used and policies SRE procedures SRE has heavy focus, as I said, on incident management, change management and error budget policies. Like what happens if the error budget is exhausted. Similarly, SRE onboarding procedure like how does a service actually get onboarded to SRE? So what is the procedure around that? Now that's about the visibility of the practices, tools, policies. Next we will look at metrics. So after all this thing like what is those value out of SRE transformation, the first thing to look at is how much toil got eliminated. Now by eliminating toil we would have saved manual effort. We would have improved the efficiency. Efficiency cannot derive a dollar value, but at least like the manual effort can derive some blue or green dollar value. Then a reduction in MTTA, like the main time to acknowledge how soon an incident actually is getting acknowledged before SRE and after SRE. The faster the acknowledgement the faster would be the recovery time like bit all start from each stage. How soon something is detected, how soon something an incident is actually acknowledged and how soon are we able two get to an insight of the problem. The time taken two insight is actually helped by having right level of observability. Now to two triads. Like we need to have the sufficient data to find what exactly is the problem. Then finally is the recovery method. Like how soon are we able to make a fix and deploy or do any recovery action. It's not always a fix and deploy. It might be a restart or rerun of something. So it can be different things or it sometimes is at complete rollback as well. So how soon is that recovery actually happening then meantime between failures, like when we know a failure has happened, then how soon those failure has happened again, what are we actually trying to do to fix known failures and reduction in postmortem action items? Like with proper blameless postmortems in place, like postmortem action items are actually resolved faster. And how soon sls are actually getting breached? Now we have the best architecture, but the sls are getting breached. Or we have the best kind of services but they are getting breached. So what exactly is the problem? Where is going wrong? We need to look at that and then fix it. And how soon they are getting exhausted. The same thing. So that's about metrics. And then there are benefits as well. Like there is better utilized with proper capacity planning. We have better utilized and planned infrastructure and we have improved tech staff experience, be developers or sres and by toil elimination and effectively handling incidents, avoiding DPD incidents. So the productivity obviously goes up and business launches. When I was part of product development and business analysis, I was part of a number of business launches around launching new markets or launching new products, launching new verticals, or even sometimes not even a business launch, it might be a launch of a new regulatory reporting. So there's so much of nervousness on the last day or on the final day that it is going to happen and will it all work as expected? So if we have SRe concepts and everything is built with a shift left mindset where we are confident that what we have built is reliable enough, the experience at business launch improves. And nowadays there are sites which show downtime messages, or there are sites which actually show the improper experience messages that are posted in social media. So the reputation improves when these kind of issues are actually reduced. And accountability. Now when it comes to accountability, how do you actually structure an SRE team? So do you have a central SRE team which takes care of everything that is required from an SRE perspective, or if that becomes a bottleneck in a very large organization, there is an option to actually split SRE by function. Like have an infrastructure SRE who focuses on infrastructure, have a data SRE who focuses on data side of things. SRE tools team focuses on building in house tools or bringing in vendor tools, integrating between them. So it's not only about bringing two tools, but it's also about integrating them in those right way and integrating with the internal. How much ever external things you bring in, there is always that internal factor that you need to consider and integrate. And then there's a concept of embedded SRE where there can be a central SRE which has sres embedded into the product engineering teams. So they work very closely with the product engineering teams with a shift left mindset where everything is built upfront, the reliability part aspects are built upfront. Then federated SRE, like in large organizations, when it is difficult to maintain a central SRE, or even to maintain something like an embedded SRE, they can also look at federated model where each vertical, or maybe each department actually has their own federated SRE teams, they are doing their own tools which actually suit their particular vertical department. But the recommendation would be to maintain the same set of policies in those standards that are set by central SRE. The toolings can vary based on technology, but the whole culture principles and the policies procedures they need to be standardized. Now, roles and responsibilities like depending on the number of SRE terms or how they are split, how they are structured, in order to make sure that nothing slips through these teams, in between these teams, and nothing is left over without a proper owner. It's important to look at RNDR of various things, like when a service, like during SRE transformation, when there are existing services, who is those destination maker to decide that, okay, these are the services or these are the applications that should onboard to SRE first and how is the actual onboarding going to be done? Who is responsible for that and community about new launches? Like how does SRE actually know that, okay, there is a new vertical coming in or there are new business launches. Sometimes it's not always related to a code release. So I have seen a number of cases where new business launches or new product related or new flows are not tied to a release and they are simply tied to a code flag. A user can either switch that flag, turn it on from a UI, or there can be some flags that are enabled from behind the scenes through some configuration change and everything starts flowing through. So it's important to make sure that this communication is sent through properly and conflict resolution. In larger organizations, there is a possibility of priority conflicts or any other conflicts between SRE in any other team or between SRE teams themselves. So it's important to identify who would be the final authority to help resolve these conflicts. That's about Arctic its visibility and accountability. Now there's no framework which can stand on its own and it needs to be combined with other concepts and frameworks for successful results. Now what are the frameworks are useful for sres? First framework is agile. Now why is agile framework important for sres? Now sres actually we looked at, there can be a tools SRE team that looks at tools now for such team because it's again product development kind of a work where they can look at adopting scrum for their development of tools. Sres with both interrupt work and engineering work, they can probably look at Kanban mod where they have their kanban queue where they sre clearing their tasks. There is extreme programming as well. And for rapid prototyping they can also use rapid application development model. Now these frameworks sre useful depending on again, which is the way the SRE teams are structured and what framework suits for which type of SRE team. And we talked about SRE helping in improving tech staff experience. Now how do we actually measure that? So recently there is this framework called space that has been introduced by Microsoft Velocity Lab and that can be actually used for measuring this. So go check that out if you're interested. And concepts of product engineering. So SRE uses lot of product engineering concepts like architecture, high availability microservices, micro front ends, the blue green deployments, canary deployments, the list goes on. So sres work so closely with the product engineering teams that they also understand the product engineering concepts and can help guide the engineering teams where required. If something is not being followed and design thinking, again, like if we are looking to introduce some new practice, new tool, or build something new, or bring in something very new feature, very creative, innovative feature or innovative practice into the organization or innovative way of doing things like design thinking can be helpful. Design thinking, we can look at it as when bit comes to bringing something new, then look at what is the business viability of that, what is the technical feasibility of that, and what is those human desirability. Like how much of adoption will be there after that is actually brought in. Then it's about empathizing with sets when something needs to be built and then ideating, prototyping and iterating over. And there's also this thing called sprint zero or design sprints. Like if SRes are looking at building some dashboards, nice management dashboards, especially for the visibility aspect or the metrics aspect that I mentioned, so they can actually look at doing some sprint zero or design sprint kind of a thing, where they can build those initial prototypes before even getting into the development of that and chicken and egg problem. Now sres do talk about building incident knowledge base. Now what is the chicken and problem is like do we build something and do producers actually produce something and then bring the consumers, or do we get the consumers even before producers completely build out what they are trying to build? Now this is a problem that can be solved by different ways. For example, if we say there is an incident knowledge base and there is not sufficient knowledge base there, then that option will not be there. So build the right level of knowledge base before spreading the word further. Similarly, it can also be about common frameworks and tools that are built, again that can be consumed by other product engineering teams before they are not fully built. If we go for an adoption, that will not happen. So it's like a very tricky situation where you need to balance out at what stage you will actually bring in the users or people to actually adopt it, then personalities and skills. So there are various personality types that will be required for a successful SRE transformation. So SRE transformation will need evangelists who can actually go in and talk about SRE and then say like why SRE will benefit for the organization or for the product, engineering teams, or even a specific practice within it. And then there are strategies who can make plans around how to do this. And then there are specialists who will be technical specialists or any other specialists who can help in the individual aspects. So there are skills and personalities and skills wise, like SRE, as I said, is a pretty broad role which includes the knowledge from engineering and operations. It's where by definition SRE is like what happens to an operation terms when it is done with a software engineering mindset. So there is a wide range of skill set required, right from understanding different types of architectures, infrastructures, testing, CI CD tools, blue, green and canary deployments. Then has engineering first testing, monitoring, observability, autoremediation, capacity, planning, some amount of machine learning. So the more the SRE knows, the more those SRE can add value to an organization. Again, it's not always possible to find someone who knows everything, but it can also be a balanced act where few set of sres focus on one area. It depends on how the organization would like to structure. And then there can be cross training that can be done and they can always upskill. And SRE is always about watching out for what is new coming up in the market and then getting the organization at that level and what are the different things to avoid. So one is about avoiding bandwagon by us now, use the right tools and right platforms for the purpose that we are looking at. There's no need to do something because someone else is doing. And no over engineering. Sres themselves accept that failures are normal and we measure failures and keep them under control at a level that is required. 100% reliability is a wrong larger and that's one of the principle of SRE. So have the right set of slos defined, agreed by users, and engineer the service to the level that needs to meet that or cross that, and then coexistence of traditional and SRE policies. Now the organization might already be using certain policies now when it sets migrated over to SRE. Now don't keep them together. Once it's transformation, it's transformation. So yeah, those are the things for my talk. So any further questions, please feel free to reach out to me on discord and thank you,
...

Vishnu Vardhan Chikoti

Senior SRE Manager @ Fanatics

Vishnu Vardhan Chikoti's LinkedIn account Vishnu Vardhan Chikoti's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways