Conf42 Chaos Engineering 2024 - Online

Chaos Engineering at Citi Bank

Video size:

Abstract

This topic will cover Citi’s journey to adopting Chaos Engineering and the benefits and challenges encountered. Additionally, a very lightweight guideline model touching on industry tools versus Citi’s in-house products and the value proposition factoring costs, compliance, SDLC, support, etc.

Summary

  • Charles Acol: Chaos engineering really is a discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. It's a common practice in the industry across tech giants like Google, LinkedIn, also the banking industry across.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
My name is Charles Acol, and I work at Citibank. I manage the USPB SRE team. And our next phase is transforming our l two into production support, into an SRE cytrology engineering team. And as part of that, chaos engineering is a very core principle that we like to adopt. So chaos engineering really is a discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. And really, this is, if you google that concept, it exists under principlesofchaos.org. But to give some background, chaos engineering was pioneered at Netflix roughly around 2010, when they migrated from their legacy hardware to AWS, and they started inducing artificial flavor. Or at least this is one avenue of where we see the initial views of it. And then fast forward to today, 2023. I mean, it's a common practice in the industry across tech giants like Google, LinkedIn, also the banking industry across. And you have Colton Andrews from Netflix and Matthew from Salesforce. They kind of merged forces and they started Gremlin. Gremlin is a tool we also use, which I'll briefly touch on. And then today also there is component of Lightspeed, which is harness. That's also another very nice tool that can be used for chaos engineering. Of course, there's other tools in the industry, there's other different flavors, but we also have in house tools. But in general, that's the idea behind it. So the benefits of chaos engineering is to promote innovation, elevate partnership, improve incident response, generate knowledge, and really increase the reliability and the resiliency to kind of measure the ecosystem from what does that mean to the customer? So different flavors that we do. We do a production game day. We do it once a month. As part of that exercise, we've written tons of ansible playbooks where through an in house tool, we do one touch failover, and then we run it through the entire day out of a single data center. So we're really doing a production stress test of the ecosystem to see what is the threshold that it can tolerate. And we have tons of applications that we do that, too. And then we get the results, we measure it, we have automated measurement, and then now we're working into an automated normalization of it as well. Another flavor that we do in production is called wheels of misfortune. Wheels of misfortune is a very fun exercise. And where what we do is we gather teams across sectors, from incident management, problem management products, sres, performance, and then we usually meet every couple of weeks for 30 minutes. We kind of pick topics, usually major outages, or we can go all the way to a cybersecurity. And then what happens after that? We conduct one exercise every quarter and then we record it and it helps build the stress level gets elevated. When you have outages, we have a lot of takeaways from meantime to recover. Do we have the right architecture diagram? Do we have the right factoring? Any opportunities for improvement? So it's kind of like a role play, and then you would have volunteers, non playing characters. So that's a fun exercise. And then we record it and measure it and kind of come out with certain results behind it to see what improvements can be done. We also do chaos engineering in the lower environment. So we do it in Gremlin. Gremlin is one of the industry standard platforms which is available in SAS. It allows to inject failures at various layers of the system. It can assess robustness using one of different attacks. Now you can do Gremlin on legacy physical jvms on Linux servers. And then you kind of measured. So let's say you have 100 transactions per second within a 30 minutes time frame. Then at the ten minute margin you're measuring the average response time. Then you invoke a high cpu attack, and then you kind of observe, is there any impact to the I o, so on and so forth. So we do that also on a quarterly basis. And now what we're doing is we're integrating Gremlin with Openshift, where you can kind of measure the pods and see the different types of attacks that can be done. The other tool that we use as well is chaos monkey. Chaos monkey is one of the original tools that was created by Netflix, and it's one of my favorite tools as well. It stimulates failures by randomly terminating instances. So you can stop one of the namespace instances in Openshift for example, or PCF, Google Cloud, foundry, dell or AWS, or whatever ecosystem you're in. And then you measure what happened to the other layers or what was the customer experience. We do have another tool which is called Ape army, that's an in house tool where we execute different types of costs to it. And then you can do basic manual tests where you can manually manipulate the environment. You can change the yaml file. If you have services that are Java based, you can change the configuration or the parameter to disable specific components, or do a restart and measure the behavior. So those are different types of attacks that can be done. You can do like resource attack, which has high cpu, high memory, high I o load. There's also some that are like state attacks where you can shut down process skill or do time travel. You can do network attacks, latency and packet loss, or a black hole network connectivity. One of the other tests that we do is not in production. We kind of shut down one of the core services, whether it's a major database or a core mainframe component, and we start measure. What does that mean to the end user? Did they get the right message? Was their data available for them? So there's different flavors around it. And if you have any questions, please let me know. But thank you for listening in and enjoy the conference.
...

Charles Akl

Head of NA SRE & NAM Core Production Management @ Citi Bank

Charles Akl's LinkedIn account Charles Akl's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways