How to avoid black holes when aiming for resilience

Video size:

0:09 Alex Williams

Hey everyone, Alex Williams, founder and publisher of The New Stack. I'm so excited to be here for the Conf42. I have a guest today, Leonid Belkind is co-founder and chief technology officer at StackPulse. Leonid, I've been following StackPulse for the past few months and I look forward to talking with you because I know you put a big emphasis on resilience.

0:30 Leonid Belkind

Thank you so much, Alex. It's a pleasure to be here.

0:32 Alex Williams

Great. So when we think about SRE, and I explored this in an article I recently wrote for an E-book republishing, and we looked at, I was looking at, kind of, where are the overlaps? Where are the overlaps between DevOps versus SRE versus chaos engineering, even versus Reliability Engineering? And it seems like that there are separate responsibilities, but there's overlapping responsibility. So how do they fit into each other?

0:57 Leonid Belkind

So let's start maybe chronologically, in the order in which they appeared, and that will explain us how they evolved. DevOps was kind of the the first to appear. And DevOps is more than a responsibility. It's a way of doing things, right? It's combining the development of software, responsibility that used to lay square on the engineering organizations, with UPS-operating software systems, and responsibility that used to be completely separate, used to belong to IT operations. So DevOps is this discipline that actually unites them, makes them use common language, common tools, common processes, common KPIs. And this is something that has probably made the biggest impact than any other procedural advancements in the field of software products in the past decade, or decade and a half. So that's DevOps. Site Reliability Engineering is sort of an evolution of that. It's a discipline in software engineering, just like let's say user interface engineering, or maybe data pipeline engineering, right? That's a specialization of software engineer. Site Reliability engineer cares about building reliable infrastructure and reliable operations. Just as like UI engineer cares about building beautiful, well designed and very usable user interface. So naturally, without the DevOps revolution, this whole notion of software engineers building reliable operations wouldn't have been possible, right? Because it was a whole separated notion, whole separated ownership. So, site reliability is indeed a profession, sub-professional insight software engineering. And it was made possible thanks to the DevOps way of thinking, right? Organizations, for instance, that still use this siloed approach between engineering and IT, it is very difficult on them to adopt Software Reliability and Site Reliability Engineering, because it really doesn't fit. Now, Chaos Engineering is a discipline inside software testing, software quality verification. Just as you could think of, let's see, business logic testing - verifying that our application functions as it should be, or security testing. Resilience testing, which is sort of like a wider term encompassing Chaos Engineering and more, verifies that our software systems are resilient to various unexpected and turbulent conditions they may experience in production environments. So who does Chaos Engineering? Majorly, the responsibility for operating that and of course, taking the results of the experiments and implementing a response to them, lie square on site reliability engineers, right? These software people that develop, not the logic of the application, but actually the fact that it drives. So this is, sort of like, the evolution, right? DevOps, Site Reliability Engineering, and Chaos Engineering as things that are being performed. Make sense?

3:40 Alex Williams

Yes, it does. And so where does Reliability Engineering come in?

3:44 Leonid Belkind

In essence, it's a much bigger thing encompassing not just Chaos Engineering. Reliability engineering, if you come to think of it, is a goal that manifests itself on pretty much every point in in the software development lifecycle. Beginning with a software architecture, where you build resilient architectures with no single point of failure, etc, etc. Moving ahead to the actual coding, when you use defensive programming in order to build components that are resilient to variants and noise in their defendencies. Moves all the way to testing, where you employ tactics or tools such as Chaos Engineering and so on, right? So resilience engineering, if you come to think of it, is a superset of all engineering processes that different people undertake in order to make your software resilient.

4:30 Alex Williams

You know, this makes you think about how organizations structure themselves so you don't have redundancies. Doesn't it?

4:39 Leonid Belkind

In deeds, a lot of times when an organization undergoes a shift from a certain way of operation into adopting site reliability engineering and so on, many questions like that come under scrutiny, right? Redundancy in terms of knowledge, redundancy in terms of responsibility. Especially, you know, recent year with the COVID-19 pandemic hitting pretty much every company anywhere in the world, the questions of how do we operate our services efficiently? How do we maintain service level for our customers? While, let's say, not stepping into the office made all these questions come under a very new lights.

5:18 Alex Williams

Let's move into discussions about resilience versus reliability and understanding the differences between the two. And I think you touched on it a bit. How do you define resilience? How you define reliability?

5:32 Leonid Belkind

So actually, the two terms are used interchangeably in many texts recently. But frankly, I believe that they are slightly different. And here's the difference. Resilience, naturally, is the ability of something, in our case of a software system, to withstand unexpected, turbulent conditions, right? Being resilient, being strong. Function in despite lots of difficult things going on around us. Reliability, in theory, is similar. But if you come to think of it, reliability is a perceived description of something, right? Reliability means you can rely on the software to be there for you. Naturally, if it's not resilient, I don't think you can have any conversation about its reliability. But even if it is resilient, that is not enough. In order for external parties to feel that something is reliable, its resilience needs to be consistently demonstrated, proven. So reliability is maybe a wider term that doesn't talk only about your software being strong and resilient, but also about your ability as an organization, as a service provider, to demonstrate and prove that it will be there for you, and you can rely on it.

6:40 Alex Williams

So you mentioned COVID, and one of the aftermaths of COVID has been much more use of services online, use in particular categories of services grow exponentially. Essential services, such as groceries, such as even services such as zoom. And so that has meant a much deeper emphasis on these different issues that people face. How would you attribute those effects on how we think about resilience and reliability?

7:13 Leonid Belkind

First of all, I think you you're spot on. During this last year of the consumption of any digital services, not only by consumers but actually by enterprises, skyrockets above any imaginable marks, really. I think it really showcased the business value of reliability. Because more and more businesses, they actually found themselves in a situation where the digital services they're providing are suddenly being under a huge pressure, right, from their consumers. And that was a good problem to have on one end, right? Your business is booming. Many people want to consume your service. But was your infrastructure resilient? Can you actually keep on providing the service under these conditions? That was the great question. And those who managed to respond to it positively, those businesses actually grew. Whereas many other businesses that, sort of like, a glass sealed their growth, because they did not invest sufficiently in the resilience of their infrastructure.

8:10 Alex Williams

Yeah. So does that make trends more relevant, such as microservices?

8:13 Leonid Belkind

Okay, now there's a big question whether it is easier to achieve resilience using modern stack? And usually the answer is yes, it is, if you use it properly. Imagine, for instance, a situation where you have a huge monolith legacy infrastructure, but actually not even running in the cloud, running in bare metal data centers, etc. And suddenly, you get 10 times more demand. Yeah, sure, you could scale those traditional IT platforms and infrastructures as well. But the cost, as well as the velocity of that scale-up, would be tremendously different from, let's say, you would be providing exactly the same services.

8:48 Alex Williams

And there's a shortage in the softwares and there's a shortage in this supply chain. Servers are harder to find.

8:54 Leonid Belkind

Exactly. And that impacts again, on the velocity of you being scaling up that kind of a business. As opposed to, let's say, providing exactly the same service based on modern cloud native microservice-based architecture, where you could always grow, let's say, within the capacity of your current infrastructure as a service provider. And if for whatever reason, in a certain region, that particular service provider will not be able to provide you additional capacity with a cloud native architecture, it's pretty easy for you to jump and consume some spillover capacity from a different infrastructure as a service provider, right? So the flexibility of your business grows tremendously. And with it, your ability to control your own margin. Because again, the expenses on growing are much, much slower. You are flexible, right, the notion of breathing architecture. So today, I have a spike in demand, I will consume more resources. Let's say, tomorrow, the demand will shrink again, and I don't have to keep paying for all that huge infrastructure I built to withstand a spike, right?

9:53 Alex Williams

So let's get into your shoes, Leonid, and you're a very young company, but you can't ignore these topics because then you could be repeating the same mistakes that companies face when they've been around. And they suddenly have these spikes. Now I was writing some notes for another recording, and I equated to that spike that might come from Reddit, or even the dual spike from Reddit and Hacker News. That's just one aspect of it. But then there might be some service upgrades that you're doing. In the meantime. There may be some internal infrastructure that you're testing. You may be thinking about building out more models, so you can better use your data. So you must be thinking about all these things. How do you think about Chaos Engineering in all that?

10:35 Leonid Belkind

Actually, being a young company may have a lot of challenges. But in this particular department, it puts us in a very privileged position that we are extremely appreciative of. We have invested from the gate in building a very properly structured cloud native architecture, definitely using microservices and service mesh. We deploy a very advanced, progressive, continuous deployments of business logic into our production. We made a lot of investments, for such young service, in observability, and controllability that actually allows us not just to figure out what goes wrong with the service, but actually to control it. And these investments, I think they were very non trivial. I mean, many people approach building new software projects, due to the flexibility that software has, by a let's focus on, sort of like, creating the minimum viable functionality first, and then we'll extend them and grow. And we took a slightly different approach, we created the proper, scalable, growing enterprise great infrastructure first, on top of which we are now delivering different, not only minimal but actually very viable, functions. So I think this platform first approach has made us very well prepared for any storm that we would weather in a production. Yeah, it was not necessarily the cheapest way to build the product. But then again, we were not looking for necessarily the cheapest way. We wanted to build a stable product, fully acknowledging that this investment will pay off.

12:01 Alex Williams

So you must be thinking about all these people who we've talked before, the DevOps team, the SRE team, the resilience engineers.

12:10 Leonid Belkind

Absolutely, and we are not only thinking about them, but we actually have established all these organizations. We have a dedicated team of Site Reliability engineers, including a Site Reliability architect, who has a very significant influence on the early stages of our software development. The operations we run in a knockless operation mode, again, very much targeted at doing this non-siloed approach. So indeed, this is one of the perks. We build something from the get-go to conform with this reality. It affects everything, it affects the architecture of your product, it affects the kind of people you're recreating for your core team, it affects the organizational structure, and so on and so forth. You're very correct here in the assumption,

12:54 Alex Williams

Yeah, for software organizations that may be older, they would pay heed to thinking about a platform first approach. Would you, would you agree? And that, though, poses challenges, but it's much easier than it was five years ago. So really, the path to viability is less and less expensive. Would you agree with that?

13:20 Leonid Belkind

Absolutely, it is less expensive. And also, it is gradual. Let's say, and we are working actually with a lot of such companies, companies that have had tremendous success building a previous generations of software architectures and delivering digital services based on top of them. And now they find themselves in a process of gradual migration to this cloud native platform. It is definitely not a switch you flip, right? Because that's not how technology works. But actually building a proper plan on how you gradually chip off piece after piece, important parts of your business logic from your monolith and transform them into Cloud Native architectures. And keep on executing on it, controlling sort of like the the amount of investments you make in a maintaining the legacy versus building the next generation, this is the way to win. I mean, there used to be a notion of lift and shift migration to the cloud that many enterprises tried. I think within a very short period of time, it became clear that it doesn't really pay off to the extent people would expect it to. Because you really aren't rearchitecting your product, right? Lifting and shifting from a server computer that resides in your data center to a server computer that resides in somebody else's data center introduces benefits, but unfortunately, they are very capped. It's this chiseling off and rearchitecting piece after piece, after piece in your products' architecture that starts paying off. And if you prioritize it properly, I've actually seen traditional companies where it started paying off really fast.

14:51 Alex Williams

I want to save something for the very end of our discussion, because I understand you do have an anecdote about Chaos Engineering. Maybe not from yourself, but other stories that you've heard. So just think about that for a minute. And before we get there, I want to move to the people and that people mindset and that organizational mindset. What are your perspectives about that for people who are customers like yours, who are maybe a water utility, you know, that is infrastructure that's really, really old. We're seeing that in Texas right now, in there's all these companies out there that are facing, you know, that other effect from the realities of climate changing. There's just so many examples out there. I mean, there's just across any industry. So when you're thinking about the people, when you're thinking about the mindset, you're thinking about the organizations what would you tell these people?

15:43 Leonid Belkind

First of all, I think I have a very positive outlook for them. I think that's, if they exhibit flexibility, as we've seen in many such organizations, water utility, electrical, the most historically progressive technologically markets, right? That's why you probably pick that example. But still, we see an increasing amounts of such companies adopt the new approaches. First and foremost, change organizationally, change culturally. And we see how, actually, this thing brings with it a wave of both improvement in efficiency in the way their organization operates. And, which is I think, on the people level, even more important, improvement in satisfaction, people with, you know, their day to day job, their impact on the well-being of their customers, etc. And this is actually a place where magic happens, right? Because this is where a lot of initiative starts to arrive from the people themselves, right? Because who wouldn't want to feel better about themselves providing a better service to their customers, etc. But let's take a couple of cultural phenomenons, for example, and talk a bit about the importance. I bet Alex, you've heard about the blameless culture, that's a term that's been used a lot. So blameless culture, right, was introduced in progressive companies, such as Google and others, particularly around investigating and analyzing incidents. The idea there was that when we analyze when something went wrong finger pointing, saying: "Wow, that's because this component that Sarah and Jack have developed, and it's so unstable, that's why the whole system went down". That's a difficult sort of like putting the blame on somebody. And there, naturally, whenever that happens, human psychology, people become defensive, some would agree some would disagree, personal relationships would kick in. And what would suffer, in the end, is the ability of the organization to actually investigate what really went wrong. And what really needs to be fixed, right? And blame does not have to be put square on eight people, it could still be on a component, this particular thing, that, I don't know, shopping cart in our ecommerce service, that's what causes all the downtime, and that's what causes us to lose our business. That's still fingerpointing. Blameless culture is transforming these arguments into: "Alright, so if we now re-engineer, let's say, the same shopping cart service, it will help us improve on our service level objectives. It will help us to make our customers more happy, and it will help our business grow". The trick, though, is and you know, people get fascinated reading by these cultural adoption of SRE, adoption of a blameless culture. The trick, though, is to interpret them in a slightly more sophisticated way than just on the face value. Because if you interpret, let's say, the same blameless culture on the same, on the face value, you can actually hurt your ability to investigate things, because you will be constantly checking yourself, wait a second by saying that this thing went wrong. Am I actually blaming someone or not? It's a matter of adopting an open culture, a culture where people don't be afraid of experimenting, aren't afraid of expressing opinions that you know what sometimes may not be 100% accurate, right? Who doesn't fail? Only the person that doesn't try, right? That's a proven fact. So it's the cultural change to openness, I think, that drives all this. And we've seen it clearly happening in even the most traditional verticals that you can imagine. Again, driven by the fact that eventually people feel better with their day to day adopting this. So I think my outlook is extremely positive, as you can see.

19:20 Alex Williams

Leaning in for my last question, I want to ask for your anecdote on Chaos Engineering stories, and what are some of the ones you remember?

19:28 Leonid Belkind

Absolutely, I think the most amazing one I heard from a good colleague of mine. He is a Vice President of Engineering of a very well-known and respected consumer service company. They are an example of a company that actually successfully undergoes gradual transition from a legacy architecture to a cloud native very modern one. And at a certain point, they started introducing resilience testing, and they were very cautious about it. So I think their first step was taking a service that is completely non mandatory under all possible business discussions. A service that collects telemetry data from the usage of their application and internal one, and they decided to test what would be the effect of it being offline. Naturally, their expectation would be, either they would lose, right? The telemetry collected from their application for the window of time during which the service was offline, right. That's the expectation. To their absolute amazement, they figured out that the public websites, the homepage of their public website goes blank, right? It became not even the application like the public, the went blank. If they could not have guessed this effect, it took them some time to figure out so like, even from the historical perspective, but that was like, sort of like: "Well, you know what, we expected things to go wrong. That's why we do Chaos Engineering. We did not expect that to get that far, like nowhere near".

20:55 Alex Williams

And that's what you want from Chaos Engineering. Leonid, I want to thank you so much for your time talking today about these issues related to, really, reliability and resilience. And want to thank you for your perspectives. Leonid Belkind is co-founder and chief technology officer at StackPulse. I'm Alex Williams, founder and publisher of The New Stack. I also want to thank the Conf42 team for making this possible.

21:22 Leonid Belkind

Thank you.

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways