Chaos Engineering and Service Ownership at Enterprise Scale

Video size:

Abstract

Hundreds of AWS accounts, tens of thousands of containers. How did Salesforce shift left to empower service owners to take charge of their chaos at scale? And where do Game Days fit in? A platform-driven approach increases speed and accuracy for Service Owners and central “Game Day”/SRE Teams alike.

Summary

Salesforce is delivering a toolkit to service owners that lets them safely run chaos experiments whenever they'd like. Salesforce was undergoing some other transformations that involved infrastructure and service ownership. Shifting left in the development cycle should reduce turnaround time on discovering and fixing the issues.
Service owners can take charge of concrete technical fixes on their own service. Game day teams that have a wide ranging view of the infrastructure can support things like compliance exercises or run organizational and people chaos. We're really passionate about giving this power to service owners.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi there, this is chaos engineering and service ownership at enterprise Scale. My name's Jay and I'm a lead chaos engineer at Salesforce. I've been with our site reliability team for about six years, working on public cloud operations, incident response, and our Chaos game day program. Now I'm an engineer on the Chaos platform team where we're delivering a toolkit to service owners that lets them safely run chaos experiments whenever they'd like. I'm here today to share with you some of what we've learned while shifting left and scaling up in the public cloud. But our story actually starts about nine years ago in 2015. In these days, Salesforce infrastructure was hosted out of first party data centers. We had a simpler application footprint, security controls and ownership models. In short, SRE owned the availability of pretty much everything, and sres had widely scoped privileged shell access to be able to log into servers and take actions for incident mitigation or maintenance and so forth. Our Chaos game day program made use of this, and we would rely on someone with that privileged shell access to log in to a server and shut it off or kill processes on the node. Or sometimes we relied on a tight partnership with network and data center engineers to do things like turn off network switch ports, or remove power from hosts manually. So a game day back in 2015 might have looks like this, where an SRE would shell in and run a privileged command, and at the same time the game day team would be observing critical host or app metrics and writing down their findings. But at the same time, Salesforce was undergoing some other transformations that involved infrastructure and service ownership. New business needs demanded larger and more flexible infrastructure, and this is things like new products and sales growth, as well as new regulatory requirements across the world. So this leads to the birth of the public cloud infrastructure known as Hyperforce. And for internal engineers, Hyperforce enforces a lot of new internal requirements and operational practices. The new infrastructure brought a bevy of foundational shared services that were used to ensure a safe and zero trust sort of environment that developers would build on top of. It also eliminated most manual interactive touch points, such as shell access. Underlying these changes was also the embrace of service ownership. Salesforce decided early on in the process that service owners would be responsible for the end to end operational health of their service and would no longer be able to sort of give that burden to SRE. So in short, we all had to learn new ways of working in the public cloud, which simply didn't apply to our first party data centers. And while we reimagined the infra and app ownership models. Our approach needed to be the same for operational things such AWS chaos Engineering we started to imagine what chaos engineering as a part of service ownership would look like, and we uncovered a few challenges. First off, in a service ownership world, SRE has less of a centralized role than before. A centralized game day team would have difficulty learning all the architectures and edge cases of new or newly redesigned services for the new public cloud architecture. And finally, like I mentioned, new technical containers around privileged access made previous chaos approaches unsuitable for Salesforce. But if we started to look at shifting our approach, we could frame these issues differently. First off, service owners know their service better than anybody else, so they'll be the best equipped to talk about potential failure points as well as potential mediations to those. Also, shifting left in the development cycle should reduce turnaround time on discovering and fixing the issues. Moving it closer left in the development lifecycle will let a service owner respond to the problem faster than if they had waited until it reached prod and a customer found it. Finally, we decided that we should deliver a chaos engineering platform that lets service owners run chaos experiments safely and easily, eliminating the need for manual interactive touch points. And this was a great idea and it's working out well for us. But there were some major scale and shift left challenges to address the size and shape of our AWS footprint, the need to granularly attack multi tenant compute clusters, the need to discover inventory and manage RBAC access, and finally, maintaining safety, observability and observing the outcomes while doing all of the above. Let's talk specifically about our AWS footprint. The core CRM product that most people think of when they think of Salesforce is hundreds of services spanning across at least 78, about 80 AWS accounts. And for one contrived example, a service might have their application container running on one shared compute cluster in an account over here, but then might have their database and cache and s three buckets, all in different accounts. So to begin with, it's infeasible for humans to log into every AWS account we have for core and do failure experiments. At the very least, we need to script that. So we translate to a requirement that says that we need a privileged chaos engineering platform that can run AWS attacks in multiple accounts simultaneously. The second challenge is around our multi tenant Kubernetes clusters that I alluded to before. Services here are deployed across many namespaces, but also many clusters, and so service owners should be able to attack their service and their service namespace. But in the multiple clusters that they might be deployed into. At the same time, service owners shouldn't be allowed to attack the shared services or the cluster worker nodes themselves. And finally, service owners may know or care less about Kubernetes infrastructure than we assume. So to translate to requirements, we need a privileged chaos engineering platform that can orchestrate attacks in multiple namespaces and clusters simultaneously, just like how we need to drive multiple AWS accounts. Also, we need the platform to provide failures without requiring ad hoc cluster configs and service accounts. Basically, the service owner should not need to know all the inner workings of the Kubernetes cluster to be able to do chaos attacks on their service. The third challenge is around inventory and role based access, so it's obviously a challenge to discover an account for all the different resource types that a service team might own. As I've said, we're spanning nearly 80 AWS accounts, and within those we've got EC two instances and s three buckets and RDS databases in addition to the containers. So discovering and making a sensible inventory of those is a challenge. Also, enforcing role based access could be a challenge. We need to control the blast radius and know that when a service owner logs into the platform, they should only be allowed to do chaos attacks on a service that they own, again removing shared infra and service components from the blast radius. So for requirements, the chaos engineering platform should integrate with discover and group all sorts of infrastructure, resources and applications. Also, the chaos platform should integrate with SSO to match service owners to their services. And our chaos platform needs to make use of the opinionated hyperforce tagging and labeling scheme to make sure that we're grouping services and service owners together appropriately. And the fourth challenge is around safety, observability and outcomes. We started to ask ourselves questions like what would happen if there was an ongoing incident or maintenance or patching. It might be unsafe for service owners to run experiments at this time. And also, how should service owners measure the success of their chaos experiments? How do we track improvement so concretely? Translating to requirements, the chaos engineering platform should integrate with our change and incident management database and refuse to attack if it's unsafe. And for the observability point, service owners should measure their chaos experiments through the same slos and monitors that they use in production. This creates a positive feedback loop so that learnings can be fed back into SRE principles for production operations, but also used to evaluate the service health and make tweaks coming out of the chaos experimentation. So if you're facing similar challenges, these are my recommendations for a self service chaos platform. Number one, pick a tool that is multi substrate and can support future flexibility. You never know what sort of regulatory or business changes are ahead, and for example, you might acquire a company that uses a different sort of cloud. I've talked about AWS quite a bit today, but hyperforce is designed to run on other public cloud providers, and if we need to support another provider tomorrow, we want to be able to do that. The chaos tooling that we've got supports that also make use of RBAC and tags and labels, et cetera, to control the blast radius and limit your attack access. There should be no question who's logging into a platform and what they are allowed to run an attack against. Also consider prioritizing extensibility to integrate with custom systems like we did for our change management system. Anything that offers custom extensions or webhooks and subscribe notifications can be good candidates. Fourth, seek out a sophisticated toolbox of attacks to support both large scale game day style experiments, AWS, well as precision granular attacks on individual services. And finally, use slos. Make them part of your hypotheses and make sure that service owners observe experiments, AWS they would observe production anytime there seems to be a disagreement between the outcome of a chaos experiment and the SLO is an opportunity to improve the definition of that SLO. And I want to talk briefly just about the ongoing role of game day exercises, because I don't mean to imply that we don't do game days anymore. We absolutely do. In this case, we optimize game days for purpose and for expertise. So service owners can take charge of concrete technical fixes on their own service because they're so close to it and they understand exactly what the failure modes can be and what potential solution paths are easiest. But game day teams that have a wide ranging view of the infrastructure can support things like compliance exercises or run organizational and people chaos. So, for example, you might test your incident response mechanism, but make sure that the first person paged is unavailable. That could reveal some interesting insights about your incident response process, your pager readiness, et cetera. Also the same, I would suggest, for shared it service chaos. So consider attacking your wiki or your operational runbook repository, right? What happens if the document repository with all of the instructions for incident remediation is slow or unavailable? I bet you'll come up with some ways that you can become redundant to that as an organization during incident response. And finally, game day teams continue to have a role in tabletop exercises to help service owners scope their attacks because of their wide ranging view, game day teams can suggest issues to service owners that have plagued other services in the past and give them recommendations on running generically in the public cloud. So with that, I want to say thank you. I hope this was useful and helped shine a light on some of the things that we experienced as we shifted left and scaled up. We're really passionate about giving this power to service owners because everybody has a hand in chaos and everybody has a hand in operating their service at scale. Thanks so much and have a great rest of the conference.

Slides

Download slides (PDF)

See all 22 talks at this event!

Conf42 Chaos Engineering 2024 - Online

February 15 2024

Chaos Engineering and Service Ownership at Enterprise Scale

Video size:

Abstract

Summary

Transcript

Slides

Jay Hankins

Lead Member of Technical Staff @ Salesforce

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2024 - Online

February 15 2024

Chaos Engineering and Service Ownership at Enterprise Scale

Video size:

Abstract

Summary

Transcript

Slides

Jay Hankins

Lead Member of Technical Staff @ Salesforce

Join the community!