Conf42 Site Reliability Engineering 2023 - Online

CICD - The SRE-DevOps Overlay

Video size:

Abstract

Site reliability engineering and DevOps practice boundaries merge and are often intertwined in most organizations. We are here to highlight one of the most critical aspects of this overlap - continuous integration and continuous delivery- and how it is a prerequisite for many core SRE practices

Summary

  • Our talk is about CICD and how a number of SRE and DevOps practices are revolving around it. My book is book on architecting cloud native serverless solutions is coming out this June. If you are interested, please watch out for the release or connect with me.
  • There are certain principles that all SRE organizations draw from our foundational book that is the Google SRE book. Toil eliminate toil is an important pillar of SRE. Breaking down complex distributed systems into simplified services is key to managing a better infrastructure and bringing better reliability.
  • DevOps bridges the gap between traditional dev and Ops teams. It builds tools and platforms for continuous improvements. This is not an exhaustive list, but rather the most common and the most prioritized practices.
  • Class SRE implements DevOps. There are many key features and key areas where SRE and DevOps align. Accept the change as the medium for business and organizational progress. Use data for decision making along using observability tools.
  • Change is what brings business value. While all changes are well intended, they don't always bring value. Sometimes code and config changes can introduce bugs or even cause production outages. Change management allows SREs and developers to assess the risk of change.
  • CI CD is the vehicle of all changes in your software infrastructure. It involves verifying and testing the code, ensuring compliance, building artifacts, and deploying the code into different environments. In an organization that has both SRE and DevOps teams, this is how it usually evolves.
  • Infrastructure as code allows you to declare your infrastructure components as code and then have an automated system apply those changes to your infrastructure. A large number of DevOps and SRE practices can be implemented via CI CD. In the second part of this talk Gerima will discuss the more advanced concepts and the futuristic look into CICD.
  • CDF is a vendor neutral body that hosts a number of popular open source projects in the CI CD space. Its mission is to bring together vendors, developers and end users to advance the standardization and best practices of CI CD. Me and Garima are community ambassadors of the CDF.
  • Garima Bajpai: I'm here to talk about kudaneous integration and delivery, the SRE DevOps overlay. He is the founder for the DevOps community of Practice here in Canada. One of his core assignments for this year is publishing a book on strategizing content delivery in cloud.
  • There are challenges and opportunity for both sides, DevOps and site reliability engineering. Many aspects of learning people are getting behind due to lack of self development and increasing a major skill shortage.
  • Site reliability engineering and DevOps would be pivoting to an optimal operating model. To fully realize the potential of DevOps at scale, the integration of SRE practices is essential. Balancing investment in tools, upskilling for reliability, visa vis rapid innovation is needed.
  • Continuous delivery practices and associated practices can provide that common core for SRE and DevOps overlay. Gartner believes 60% of the organizations procuring software in 2025. Functional advancements of supercloud, providing interoperability with specifically content delivery capability.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, thank you for joining us. Our talk is about CICD and how a number of SRE and DevOps practices are revolving around it. Let me start with the QK introduction. So this is pretty much who I am. For most part of my core career I have been in SRE and DevOps, just a small and shameless promotion. My book is book on architecting cloud native serverless solutions is coming out this June. If you are interested, please watch out for the release or connect with me. Now let us move on to the talk so SRE principles I wouldn't go into the definition of SRE adopts. We are in an SRE conference and I'm pretty sure all of you are aware of them. I'm also sure that we all acknowledge the fact that SREs and SRE organizations comes in all shapes and forms and what they do and how they do things. Those are mostly defined by their organization and their organizations culture to a large extent. But there are certain principles that all SRE organizations draw from our foundational book that is the Google SRE book. I will do a quick recap of those in this slide. Embracing risk SREs estimate the cost of reliability, assess and manage the risk involved in improving the reliability and use the error budget smartly slox or service level objectives. It is a direct measure of customers experience and hence it translates to the reliability of your service. Toil eliminate toil is an important pillar of SRE and toil is a repetitive and manual task required to keep your production services up and running, and this can be eliminated using automation monitoring. Well, nowadays we call it observability and it is the key to measuring all the critical vehicles of your systems. Release engineering standardizing the build and release of software into production automation automating the repetitive work to improve developer velocity and productivity, including building platforms. Simplicity we all work with complex distributed systems. Breaking down them into simplified services is key to managing a better infrastructure and bringing better reliability. Now, there are a lot more to the SRE functions than these seven points, but these are the most important and the most often prioritized ones. DevOps practices so in this slide we are going to look into the DevOps practices. But unlike the SRE book, there is no one standard list of recommended DevOps practices. So I have condensed some of those commonly found practices into this list. Communication collaboration by definition, DevOps bridges the gap between traditional dev and Ops teams. Now this is achieved through collaboration at all stages of software development lifecycle. The agile methodologies like scrum and Kanban are very critical in this phase continuous improvements. So continuous improvements involve gradually rolling out small changes so that the development teams can iterate their products and services fast. And this involves tools and practices like continuous integration, test driven development, continuous delivery, et cetera. Monitoring. Here we see a recurring theme here. We need observability to assess whether the steps we take for continuous improvements are fruitful, and we also need it to ensure that we have is one production systems and it gives us a continuous feedback loop automation. Similar to the SRE principle, DevOps also builds tools and platforms for continuous improvements. Remember, just like the SRE principles, this is not an exhaustive list, but rather the most common and the most prioritized practices. And some of these points could even be broken down further into its own list. Now that we have looked into both SRE and Dow's principles, let us take a quick look into how they relate to each other. Class SRE implements DevOps now when we discuss the relationship between SRE and DevOps, this is a statement that comes up very often. The idea this statement projects is that DevOps is a set of high level principles that should guide the SDLC and the agile practices, and SRE implement many of these principles by adopting them to the distributed production services. There are many key features and key areas where SRE and DevOps align. For example, both value collaboration communication and they use it to build teams and set organizational culture and both operates on a shared ownership model along with the developers. Accept the change as the medium for business and organizational progress, understand and accept the risk that comes with the changes. So this is a key principle that changes are necessary and changes can bring failures. Change management and software release with the right tooling and controls, using CI CD is a critical piece of the entire software development lifecycle and hence the part of SRE and DevOps. Automation is a key for toil reduction and developer productivity. Building developer tools as well as platforms for managing production services is part of the automation initiative. Now this is where platform engineering also comes into picture. SRE and DevOps use data for decision making along using observability tools. Now SRE focus mostly on slos while DevOps focus on dorometrics. Now this is just a primary focal point, but observability covers a lot more than this and is required for running production services. There are more areas where SREs and DevOps practices align, but this should give you a good idea. Now if you observe this relations and DevOps and SRE in general, you will see that there is a recurring theme emerging and that is change. Now these changes could be code, configuration or infrastructure. Now why is change important? Let's take a quick look. So change is what brings business value. Any new features, any stability improvements, and any other sort of changes, they all work towards this one goal. Ultimately, if a change doesn't bring value directly or indirectly to the business, that is not a change worth pursuing. Now, does this mean that change is always positive? Let us see. So changes can lead to production outages, bugs, LSL breaches and a lot more. While all changes are well intended, they don't always bring value. Sometimes code and config changes can introduce bugs or even cause production outages. Now, the outages or the incidents as we call in SRE world, can directly impact customers. Sometimes it can also set back the engineering team by a few hours or even days while they are busy fixing those incidents. Now, SRE builds incident management practices to effectively deal with incidents. But in a software system, as long as there are changes, the chances of them causing incidents will also remain strong. So if we need to avoid service disruptions and customer impact as much as possible, we need to take one step back to the stage where before the code hits the production, let us see how that is done. So we roll out changes that have positive impact on our products or services. But once we do decide that these are changes to go that we need to push through, then we need to go into the discipline of change management. Now, change management allows SREs and developers to assess the risk of change, evaluate the acceptable risk, and roll it out to production. Now, tracking of changes, along with its evaluation and acceptance is the part of the necessary bookkeeping in change management. Now this is usually achieved using a software change management repository service like Git, along with pull request and peer reviews. And I'm sure all of you are familiar with this workflow. Now, SEM process stops at this point. From here, SRs and dobs have to take it further with release management. Now this involves testing, building and development and deployment. Now this takes us to the central point in our talk CI CD. So the central theme of everything we discussed so far is change. Now, with all that we covered about incident management, change management, SRE and dows principles, all this revolve around changes. But how do we manage and roll out changes effectively? The answer is obvious, and that is where CI CD comes into picture. It is the vehicle of all changes in your software infrastructure. Now, SEM is the foundation to CICD, but most of the tooling for SEM are built around your issue tracker and code repositories. CI CD will integrate with these tools, but it involves a lot more than change management at a high level. It involves verifying and testing the code, ensuring compliance, building artifacts, and deploying the code into different environments. And finally, that code will make its way into production if it is production worthy. In the title of our presentation, we called CICD the SRE DevOps overlay. Now what do we imply with that? It is the idea that a large spectrum of SRE and DevOps responsibilities are influenced by or even a byproduct of the changes, and the change cycle is managed through CI CD pipelines. Now let us do a dissection of various aspects of CI CD and how it relates to DevOps and SRE. Before we move into those important points, recall that both SRE and DevOps advises small incremental changes to be rolled out to avoid impact and manage changes effectively. Now, the only way to achieve this is through a fully automated CI CD pipeline testing. Now this is an obvious one, but we never do this sufficiently. Functional testing improves our confidence in the functionality of our application. Nonfunctional testing, on the other hand, improves our confidence in the application stability, scalability and security. Now, most of the functional tests, like unit and integration tests are automated as part of the continuous integration phase, whereas the most non functional tests are automated at various stages of deployment to various environments, including production stage, UAT and whatever you might have. Now, some non functional tests like performance testing are also done post production deployment, and this also generates a lot of reports on code quality, test coverage, all those things. While DevOps drive the adoption of most of the functional tests, SRE concentrate on the non functional part of the equation. Now this is not a set boundary, but in an organization that has both SRE and DevOps teams, this is how it usually evolves. Now there is no single standard for application configuration. You could use any of these listed methods or combine many of them. But remember, one of the key SRE principles is simplicity in design, configuration, business logic, et cetera. Whatever you choose, be consistent with it and make sure that any changes to your configuration values can be versioned and audited. So you infrastructure and configuration there are different ways in which you can provision your infrastructure and configure the application. In traditional it infrastructure, configuration management was the standard way to provision and configure your servers and applications. Now if you have workloads running on bare metals or vms, make sure to use a configuration management tool and commit your recipes to version control. Now these recipes could be play ansible playbooks or it could be salt stays in the salt stack ecosystem, but you get the idea. Now, the era of cloud management and Kubernetes has brought in newer ways to provision, configure and configure infrastructure. Infrastructure as code allows you to declare your infrastructure components as code and then have an automated system apply those changes to your infrastructure. Network as code is a subset of this and applies the same principles to network device configurations. Now there are generic tools like terraform as well as vendor specific tools like AWS cloud formation that are used for infrastructure as code management. But irrespective of the technologies used, this enabled SRE and DevOps to treat their infrared resources as code and manage their lifecycle Gitops and CICD now the evolution of IAC or infrastructure as code gave birth to another idea, Gitops. It is the new philosophy of for managing systems and resources. Now this philosophy can be broken into following elements. Desired state the declaration of the desired states define what is the end state of your infrastructure and its resources. Now, irrespective of the tech used to declare these resources, the changes should be versioned and immutable, which naturally leads us to storing the declarations in Git. Now there will be system specific software agents that will pull any changes automatically and apply it to the destination system. These same agents will also watch in real time for the state of the system and reconcile any drips to the desired state. Now Githubs can be implemented in a number of ways. As long as you have a Git based version system that allows either pull or push based notification, that should be enough for agents to discover the changes and apply them. But what is the most natural way to do this? Of course it is CI CD. If you commit the changes to disabled state into the gate, the changes can be picked up by a CI CD system, treat it like any other application code and proceed to apply those changes. Now this is quite convenient and uses all the tools and workflows that SRE used in the application deployment Gitops and the XS code revolution. Now the evolution of Gitops and infrastructure as code brought out more practices that can be declared as code and implemented. Policy as code allows you to enforce organizations and security policies on your code, configuration and resources. SLOS code definitions and tracking of slos for large number of services very tricky. Codifying them into SLOS code makes it very easy. Dashboard as code is something that has existed for a while. You might have seen this with Grafana and similar tools. Now similarly you could configure your observability sidecars also using code. Even the CI CD pipelines themselves can be defined as code and this will help you when you onboard new projects. Now there are more to this list, but this is how you bring SRE and DevOps through CI CD. So, to summarize what is discussed so far, a large number of DevOps and SRE practices can be implemented via CICD. This helps in standardization of the best practices and improve the reliability posture of your services. Make sure that your CICD evolves to accommodate these practices as your SRE and engineering maturity grows. So there are a lot of advanced SRE DevOps practices with CI CD. And in the second part of this talk Gerima will discuss the more advanced concepts and the futuristic look into CICD. But before we go into that, I would like to take a minute to talk about our mutual association with CDF, the continuous delivery foundation. So CDF is a foundation under Linux foundation umbrella similar to the CNCF. CDF is a vendor neutral body that hosts a number of popular open source projects in the CI CD space. Its mission is to bring together vendors, developers and end users to advance the standardization and best practices of CI CD. Now this is the official definition of what CDF is, but the next slide might give you a better idea. These are the projects that are currently managed by CDF, Jenkins, Jenkins X, Screwdriver, Spinnaker and Tecton Power CA and CD pipelines. Persha is a decentralized package network based on blockchain. Otelius is a microservice catalog with supply chain intelligence and domain driven design support. Shipwrecked is a framework for building container images and CD events is a common specification for continuous delivery events. I'm sure this would give you a better idea of what CDF is. Me and Garima are community ambassadors of the CDF and work towards the better community adoption and standardization of our projects as well as the general CI CD best practices. Over to you Garima. Thank you everyone. Hello everyone and I'm Garima Bajpai. I'm here to talk about kudaneous integration and delivery, the SRE DevOps overlay. This is a joint topic which we have taken up together with Safir and myself. You must be having a good view on what the topic is all about through Safir. Now I would like to talk about from a high level futuristic perspective. Why should you care about this topic? So before we actually get going and get started on the conversation today I'm going to have with you I would like to introduce myself. I'm Garima Bajpai based out of Oreva, Canada. I am the founder for the DevOps community of Practice here in Canada which has several chapters. It has around about 1500 members and it is at various locations. If you have not checked out this community of practice, I would do recommend to do that. I'm also the chair for the Continuous Delivery Foundation Ambassador Group, which is a group of practitioners primarily in the continuous delivery space which is fostering change, evolution and future perspective of continuous delivery when it comes to open source technology and tools. I'm also a course creator and content provider with various affiliated organizations. If you want to kind of check out my work, you can go to DevOps Institute as well as condensed Delivery foundation for references and content and courses created on DevOps and SRE primarily. I am also nominated for the DevOps Dozen in 2022 Community Awards for top DevOps Evangelist and one of my core assignments for this year is I would be publishing a book on strategizing content delivery in cloud with pact. It is coming out in July. If you haven't checked out that I would recommend to do that as well. It is available on Amazon for pre ordering. So now I get started with site reliability engineering and DevOps. It is amazing and hard to believe that we only started a decade ago with all this, right? With most of the practices and concepts which are in the content delivery space, there's exponential growth of tools. There is also increasing complexity. It is inevitable that the complexity which is increasing with the rapid adoption and it is bringing new operational challenges, risks, overhead and cognitive load on the practitioners. Moreover, cost with more and more services and applications moving to the cloud, financial practices and engineering practices, SRE getting highly integrated. So if you have not heard of finops or cops optimization dryers in big and small organizations, I think it's high time to check out that movement as well. And lastly, I would say skill shortage that there are many aspects of learning people are getting behind due to lack of self development and increasing a major skill shortage. So there are challenges and opportunity for both sides, DevOps and site reliability engineering. But before we actually move forward, we would also like to understand that how did we reach to this stage? So if you think about how these two movements got started and from a DevOps perspective, the DevOps practices were primarily kind of developer centric. It was a push from the developer productivity and obviously the evangelist started looking at how do we shorten the lead time for our software delivery or incremental deliveries. And that's where a lot of practices got kind of introduced and adopted. Whereas there was a specific set of industry practices on evangelist which were looking at like how do you bring enhanced reliability posture when you talk about decentralizing, or when you talk about bringing DevOps practices on the table with flow, feedback and experimentation? So there was like a constant push on how do you bring customer experience, reliability posture and increasing stability with the increase of developer productivity in the same context. So we see that these two mutually enforcing practices are in waking for several years now, and we are at a stage where we would like to kind of discuss that, how site reliability engineering and DevOps would be pivoting to an optimal operating model and to fully realize the potential of, let's say, DevOps at scale, the integration of SRE practices is essential. Everybody agrees to that today. Balancing investment in tools, upskilling for reliability, visa vis rapid innovation is needed to bring that optimal operating model in place. And could continuous delivery be that common code, the common trigger to be the binding force behind that optimal operating model? Let's explore more. So, before we actually go further into this conversation, it is also important we talk about the law of diminishing returns. And why do we do so? Here in this talk is because if you think about DevOps practices or SRE practices, there's a certain set of output which is envied, which is expected or which is aligned to the business goals. And there's a substantial amount of time and effort needed as an input to steer that right. But the more we actually move towards incremental deliveries or enforcing these practices, we will realize that there is a point where we will have the point of maximum yield, that now, howsoever we provide you with input, with time, with people, with efforts, with practices, with tools, we have reached to a maximum yield point. And then from there we would be dropping our productivity to negative returns because of the complexity which is getting introduced, the number of tools which we have adopted, and also due to other factors which we have talked about, like cost and skill gap. So in order to ensure that we understand that there is an optimal operating model which is needed to be put in place for individuals, communities and systems to be sustainable, what can be that optimal operating model? And how do we know, or how do I know that my organization is ready for that SrE DevOps overlay which we are talking about here, which can bring that optimal operating model in existence? So there are few things which we can do as an individual, as teams, as organizations, and these are questions which we probably will have to ask ourselves. So, first of all, the main business objective, why are we deploying and what is our mission vision? What kind of applications are we deploying? Are they monolithic or microservices? Where are we deploying these applications? What Sre the core objective for organizations? How many cloud providers are involved? How could we keep them converged? And lastly, how often do I want to deploy and why? If we start answering these questions, we will come to a point where we would be able to assess or analyze our state of nation from an organizations individual or community perspective. That whether we are ready for that DevOps SRE overlay and start talking about the common core. The common core which is the continuous delivery and associated practices. And when we talk about that common core, essentially we are talking about four principles, declarative way. The entire system has to be described declaratively. The second principle is version and immutable. The canonical desired state is version and does not matter which tool is it. The third principle is pulled automatically. So how much operational overload we have in the system, and if we approve changes automatically and apply it to the system, that would be one of the principles which can help you get to that optimal operating model. And the fourth point is continuously reconcile so software agents to ensure that correctness and learn on diversions. So if you think about these four principles, make sense to your delivery. Let's continue this dialogue. So we have reached to a point where we have built some consensus on the common core. Continuous delivery practices and associated practices like continuous integration and deployment can provide that common core for SRE and DevOps overlay. Now, we also have, I would say, challenge in terms of how we measure this progress. And primarily, if you think about measurement perspective, a lot of things have been done, and mostly we talk about industrial practices or best practices around Dora. And I would like to kind of also highlight through my talk that if you are looking at that common core, the optimal operating model, it's time for you to go beyond Dora and I would highlight some of the functional advancements which are associated with this. Functional advancements of supercloud, or sometimes referred to as cross cloud, providing interoperability with specifically content delivery capability. So we'll have to kind of assess that posture moving forward using an event driven approach and introducing a high level of reusability, flexibility and full stack interoperability for the complete software lifecycle being the second one. The third functional advancement I would emphasize is progressive delivery with machine learning capability and reducing the challenge of adoption of progressive delivery. The fourth bullet item, and that I called silver bullet is s bomb for the future as SBOM software bill of material continuous to evolve. So is the framework for data exchange and the need for a standard format. So these are the functional advancements which you can correlate to your measurement of success in getting to that optimal operating model. Now I would highlight this SBOM as one of the core of fundamental of foundational capabilities which will help you manage the complexity and securability of modern software deployment. Gartner believes that in 2025, 60% of the organizations procuring mission critical software solutions will mandate as form disclosure in the license and support agreement, up from less than 5% in 2022. It is essential to make it as part a measurable part, a tangible part of your delivery as you go along. There are operational advancements as well, which we should see for measuring portability being one of them. There are variety of languages, platforms and frameworks being used today. How do we make portability as one of the measuring criteria for our software delivery components? To reduce the cognitive workload for not only developers but also for SRE practitioners, flow optimized and observable deliveries. So whenever we are going to decentralize our delivery, we go from monolithic to microservices. We have a distribution management system. We need to have a flow optimized and observable system. So have you introduced that component in your software delivery that can also be one of the measurable aspects of the common core? I will also talk about resource optimized and resilient. That means optimized posture of infrastructure. It is not okay to add fixed cost to your products as you go along, so you will have to look at that optimization at the infrastructure layer. And lastly, I would say real time and dynamic. So how real time and dynamic your software delivery capabilities are which can support rapid scale up, scale down and address the real time requirements. When we talk about all these changes, we also talk about the infinite bandwidth and zero latency from demand and supply value chain system. So we need to kind of ensure that we have some supporting, measurable general purpose pipelines and services which can be serving our fundamental needs or general purpose or they can be reusable. We cannot forget small and medium business services. So think small before you think big and high performance community collaboration hubs which are mostly needed for entrepreneurs, for evangelists, for practitioners to foster collaboration and create a possible futuristic approach on all this. I will not talk about more on this because state of art AI capabilities SRE going from beyond cloud native to edge native and full stack interoperability and enhancing reusability can be some of the key areas of focus for the next generation practitioners. When we talk about the overlay of DevOps and SRE, we also think about net zero commitment and SRE can lead the way there. One of the critical threads which ties everything back together is net zero commitment. SRE can lead that way. And what if we can create a marketplace of carbon neutral products and services where we can cascade the impact? If we would intend to do so, we might have to consider future evolutionary changes and some of the next steps could be as follows, like mapping your carbon footprint for products and services and identifying hotspot features guidelines for financing of ICT products and services, certification and decarbonization of ICT services and products through tools and processes. So we have tried to kind of ensure that we bring here and now aspects of the SRE DevOps overlay through Safir's conversation and also try to ensure what is in stored for future from an optimal operating model perspective. When we talk about SRE DevOps overlay and how continuous delivery stack can create that pivot or create that binding together perspective around all this for the future. If you like this conversation, do follow me on LinkedIn or get in touch with me or contend through our social handles. But for now I would say goodbye and thank you for listening to our talk.
...

Safeer CM

Senior Staff SRE @ Flipkart, CD Foundation

Safeer CM's LinkedIn account Safeer CM's twitter account

Garima Bajpai

Founder - Canada DevOps Community of Practice @ Crowdbyte Solutions

Garima Bajpai's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways