Chaos Engineering for SRE Enablement

Video size:

Abstract

Chaos Engineering is the “new kid on the block” in SRE.This talk explains how we at HCL Cloud Native Labs are on a journey to enable Enterprise Delivery teams in Chaos Engineering. It is a new paradigm to enable for an organisation of such scale and hence a path breaking journey to embark.

Summary

Chandra Dikshit is Sre architect at HCL Cloud Native Labs, London. He will be breaking you through our journey of chaos Engineering through which we have enabled SRE upskilling and SRE practice enablement into various delivery teams and clients.
Our SRE services fall under three categories. SRE enablement services under which we do skill assessment of a team of a client. Also do SRE consulting services quite a bit to our clients. Since we started this program, now we have enablement more than 1000 SRE.
Chaos Engineering has been a niche, up and coming kind of practice. It started at Netflix and has been developed into more methodical, more customized kind of a practice. We've been running around 20 plus cohorts since then. We have done around 100 plus client showcases.
An example of how chaos workflow works in a cloud native kind of environment. Can contain multiple experiments and then drive value, make decisions on the basis of that. How this is how chaos engineering can make your service more reliable.
There is a rich selection of tools, chaos engineering tools now available. These are quite cloud native, all of these tools are compatible with almost all the versions of kubernetes. There are enterprise versions available for multiple of them. The chaos engineering practice has been a wonderful enabler in that journey.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. Welcome to the session on SRE enablement through Chaos Engineering Conf. 42 SRE 2023. I am Chandra Dikshit. I am Sre architect at HCL, HCL, HCL, HCL Cloud Native Labs, London. In this session, I'll be breaking you through our journey of chaos Engineering through which we have enabled SRE upskilling and SRE practice enablement into various delivery teams and clients here at HCL Tech. So first of all, let me take you through what we do here. HCL cloud native labs labs so we are cloud native labs thought leaders of cloud native programs and strategies along with cloud native Engineering here at HCL Tech. We are located in three locations in US, UK and in India. We work across four key areas as you can see on the screen, strategy and direction, art of the possible adoption and enablement and cultural transformation. You can see that the range of areas, range of practices that we work across is complete spectrum of cloud native engineering. We start from a strategy over to building the state of mind workforce upskilling modernization to the showcase of what new is coming up in the ecosystem, what new is coming up in the industry from hyperscalers and how that can be adapted to our clients ecosystem, our clients specific scenarios. We are very skilled team of engineers, architects, strategists, technologists and as I said, we work across the board for anything which is cloud native and therefore we interact with a lot of clients. One area where I am specifically inclined towards is the cultural transformation part workforce modernization, cloud native state of mind part, and specifically within there the DevOps and SRE culture building. I'm part of a unit where we upskill our colleagues into SRE practices, DevOps practices, and a part of that. Very key part of that has been chaos engineering. And we'll expand on that in a bit. So first, what services we provide out of cloud native labs, what SRE services, particularly we provide out of cloud native labs. So our SRE services fall under three categories. SRE enablement services under which we do skill assessment of a team of a client, of a business unit. Then we do enablement. So we design programs, we design custom learning journeys, we run certification programs all through providing end to end management from the labs. We also do SRE consulting services quite a bit to our clients, wherein we go into their environments, we assess their environment, we assess their tool sets, we assess their maturity at which they are running into their operations, their development areas, and how SRE inclined, how mature their processes are in terms of SRE maturity scale or index. And then we work with them to design programs to enable practices of SRE, like chaos engineering, like SLI slos, like observability enhancements, et cetera. And third, if there's maturity assessment services that we also do, which is purely basically consulting service, going in, looking at the state, probably doing third party assessment of architecture automation setups, SRE setups, et cetera, and then maybe doing some coaching as well, just to guide them onto the right path. In doing so, we do work with a lot of customers. So far, I think since we started, we have interacted with 100 plus customers, particularly in the SRE space and from the lab. Since we started this program, I think two years back, now we have enablement more than 1000 SRE who are certified from the certified certification program that we run here at labs. So it's done quite at an extensive level, and it has become now the de facto standard of SRE enablement across our organisation. So coming to chaos engineering, then, what has been our journey with chaos engineering? Chaos Engineering has been a niche, up and coming kind of practice. It started, I think, quite a while back now at Netflix when I think they came up around 2016 17 and explained what they do with the tools like chaos monkey. And since then, it has been developed into more methodical, more customized kind of a practice. So we started with the practice around August 21. So we started exploring products, we started looking at practice, the different concepts and aspects, the value that it brings, and basically what SRE, the art of the possibilities around chaos practice, how we can benefit, what value we can bring to our customers. We started from there, we started utilizing a few tools in a smaller capacity, doing basic kind of experiments. And I think by end of 2021, we were doing experiments which were like, which can be called simple chaos experiments, like pod delete, node shutdown. But we also started doing them in a more as code kind of scenario. We started integrating them with our CI CD pipelines, et cetera. So we were also doing chaos engineering, but at the same time, we were also looking to kind of automate the execution on them so that they can be attached to an already existing flow. Furthermore, startup 2022 late in 2021, around Christmas particularly, we started playing around with the workflow kind of scenarios where you can combine multiple experiments, and the intention there was that we can create something very close to what an actual fault in a production scenario can be, how close you can get to that scenario, and basically test your services, their resiliency, and then work on them. Finally, I think around Feb March 2022, we sort of included this as a standard part of offering that we demonstrate to our clients that we include into our SRE cohorts, our SRE enablement trainings. And since then we have been maturing this along implementing and exploring these in hyperscalers. Tools like Azure Chaos Studio, AWS fault injection. We've been explains obviously the cloud native tools, chaos Mesh, litmus chaos standalone tools like Gremlin. Now harness also has come up. So we've been exploring these tools quite a bit. We've been using them, building them into our client demos as well, and they have been received quite well. This is a new practice, so we had to kind of explain this, build some material around it to explain the value of these to a prospective client or to colleagues. But it has been well received and you can see from the numbers here, we've been running around 20 plus cohorts since then. We have done around 100 plus client showcases, so it has been a major part of that. Then how have we progressed? So as I was explaining a little bit in the previous slide as well is that we started with simple experiment. We wanted to try have a taste of how do you inject faults. We started in VM based scenarios, something replicating chaos monkey, maybe taking down certain part of VM based stack or data center based tag. Then from there we went on to kubernetes based stacks, cloud native stacks, taking down pods, blacking out particular services, dropping DNS, et cetera. So there was simple experiment and they were being manually driven. From there on we went into workflow based experiments which were more logical, like if we start from here and if there is like taking an application, if there sre five microservices dropping their services one by one, or deleting pods one by one, and seeing cyclically what is the impact and how well those microservices recover. We started thinking in teams of CI CD integrations, argo CD integrations, GitHub integrations, and basically making it more automated workflows, something that we can attach to an end of existing delivery workflow or deployment workflow. So that every time there is, from developer's point of view, from developer velocity point of view, that every time there is a deployment of a new version of a microservice or version of a Java kind of service, the flows that are designed for chaos, they run in production after the service has been deployed. And that is the resiliency of the service. And if that goes through, we just go ahead with the deployment, right. The kind of concepts that we have taken care while designing these experiments have been things like design a hypothesis, select a blast radius, then test and observe. And based on the insights, you basically improve your service all over again. We have also expanded to hyperscalers. So the kind of technical domains that we have covered have been hyperscalers. Azure aws, particularly kubernetes based environment, have been a big hit. There are quite a few tool sets which cater to cloud native stacks, cloud native environments. And we have also developed some solutions for private and on prem private cloud or on prem based chaos experimentation, because quite a few of the clientele, quite a few of the environments that we work along with will be on prem or private cloud. Next, I have an example of how chaos workflow works in a cloud native kind of environment, and how we demonstrate these kind of values, these kind of chaos engineering impact. How do we basically explain or emphasize that from two sres, particularly, that this is how chaos engineering can make your service more reliable? Right. In this example over here, what we're showing is how we can run a chaos workflow, which can contain multiple experiments and then drive value, make decisions on the basis of that. Right? So this example was done with tool sets, mainly chaos tool set, which was litmus chaos, and then GitHub actions, Argo CD, which is basically part of litmus chaos actually, and Grafana for observability, Grafana and Prometheus for observability, actually. And then the application that we have used is the block shop microservice application. So how the workflow works is that you can design your experiment, you can design a complex workflow of experiments, like take down one service, first see the impact on other services, then take down another service, then see the impact and keep on going like that. It can be written in form of a YAML workflow file, and then the developer can simply check it in. Right? And once that checks in, the GitHub action is triggered. And then the GitHub action basically triggers the submission of that workflow to argo workflow server, and then Argo workflow is basically started. Argo workflow is the one that executes this complex set of experiments. So how it does that is that it creates custom resources prescribed or provided by litmus. Those are like chaos experiments, which are definitions of the actual experiment that you want to run. Then it creates the chaos engine, which is basically the running instance of the workflow. And then finally it generates something called chaos results, which sre basically the outcome that tells you how those experiments and that those workflows have fared. So argo triggers the workflow, generates those crds, and then once those crds are generated, they run themselves to create the kubernetes, native objects like jobs and running pods, which will then execute your experiments. Once that happens, that obviously impacts the application and the impact can then be captured by capturing the golden signals which can be observed on your observability dashboard, in this case Grafana. One thing which we can also include, and we have done that, is that we can put in a bot kind of scenarios that, okay, if your blast radius is getting too big, or if the impact on your golden signals like latency is becoming too big and the service is starting to go down, the crds can be aborted, the chaos engine can be stopped, those kind of things can also be incorporated. And this is just one example with litwuse chaos, but the same kind of things can be done with other tool sets as well. And I'll talk about those tool sets in the next slide. So talking about tool sets and there is a rich selection of tools, chaos engineering tools now available. And thus that's a very good thing to practice chaos engineering in the current setup, current ecosystems, whether it's CNCF ecosystems or hyperscalers, or even individual enterprise players like Gremlin and harness. So it's really good time to practice chaos engineering, I'd say what are the points that we like in this practice with the tools that are available right now is that quite a few of these are open source, so you are free to experiment with them. And once you kind of happy with the product, that's the product you want to sort of roll out into your environments or your clients environment. There are enterprise versions available for multiple of them. These are quite cloud native, all of these tools, so they are compatible with almost all the versions of kubernetes. Lightweight Kubernetes managed kubernetes as well as your on prem and cloud platforms. So the range is quite huge. And then there is detailed API support, right? So if you want to do these things programmatically, like I was explaining, through workflows, Yaml based files checked in into your GitHub repositories, triggering your CI CD pipelines, or in terms of kubernetes, custom resource support so that you can control these through operators, et cetera. It's quite good. There are best practices support like GitHubs observ Githubs through argo observability through Prometheus, GitHub action supports and other CI CD tools SRE compatible as well. Not the least great documentation, which is quite key. Many a times you see open source products not having such good documentation. So these tools, I think are quite good with that especially helps when you're exploring, when you are looking to adapt them to your particular scenarios, your use cases. And then finally, the enterprise versions of many of these tools are now available. So once you come to a stage where you want to go to production with these practices, or you want to adapt to these practices, introduce them, roll them out to your organizations, you don't really have to worry about being open source and not being able to find enough specific support. There are enterprise versions available. The hyperscalers are also coming up their tool sets now, which is another thing, because these tool sets like Azure Chaos Studio is quite native to Azure services. They are built and they are native, and if you want to run experiments into your Azure environments and Azure chaos studio bots really well. So basically what we wanted to showcase through this presentation is that how chaos engineering, how has our journey been with chaos engineering, particularly how this has helped us rolling out chaos engineering, enhancing the understanding of our sres and our clients in their journey of SRE adoption and chaos engineering tools. The chaos engineering practice has been a wonderful, ah, enabler in that journey. So with that, I would say thank you very much. If you want to reach out to us, you can reach out to us at SRE, underscore cnl@etscl.com thank you very much.

Slides

Download slides (PDF)

See all 20 talks at this event!

Conf42 Site Reliability Engineering 2023 - Online

May 04 2023

Chaos Engineering for SRE Enablement

Video size:

Abstract

Summary

Transcript

Slides

Chandra Dixit

SRE Architect @ HCL Cloud Native Labs

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering 2023 - Online

May 04 2023

Chaos Engineering for SRE Enablement

Video size:

Abstract

Summary

Transcript

Slides

Chandra Dixit

SRE Architect @ HCL Cloud Native Labs

Join the community!