Conf42 Chaos Engineering 2021 - Online

One year of SRE failures

Video size:

Abstract

Last year we pitched SRE to our management team and got the OK to get cracking. We’ve achieved a lot, but failed even more. This talk is a front-row seat to a blameless postmortem on the first year of SRE at bol.com, the largest online retailer in The Netherlands and Belgium.

Getting SRE right is hard. We tried in 2017, we failed. We tried in 2019, we failed. We picked ourselves up, dusted off, took the learnings and tried again in 2020. This time we’re here to stay. bol.com is the largest online retailing platform in the Netherlands and Belgium. We have about 10 million active daily users and innovate with about 700 software engineers.

This is the story of a scrappy team of Site Reliability Engineers trying to shift a large enterprise (about 3000 employees) to think SRE. Our successes, but far more interestingly, our failures and learnings.

Summary

  • Sre@ball. com is a large online retailing platform in the Netherlands and Belgium. We've been experimenting with SRE for about the last two years now. In this presentation, I want to go into our learning so far.
  • Bolocom's journey to SRE kind of started in 2009 when we had our agile transformation. With that migration, we also started to move towards a product driven organization. In 2020, Bolocom started with a dedicated, full time SRE team.
  • Bolcom's SRE team helps develop the tooling and the practices to enable all other teams to find this balance between reliability and innovation in the right way. SRE really is a way to get DevOps right.
  • So yeah, that doesn't come quickly and you really need to engage with all the levels of the organization. It's really important to celebrate your successes, otherwise you're not going to have the stamina to ride it out to actual success. Have lots of fun with 42 chaos engineering and I'm looking forward to any kind of questions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hi, my name is Bart Angalau, and welcome to my presentation at 42 Chaos Engineering. Let's talk about site reliability engineers. So as I mentioned, I'm lead sre@ball.com. Which is a large online retailing platform in the Netherlands and Belgium, the largest one. And we've been experimenting with SRE for about the last two years now, I guess. So in this presentation, I kind of want to go into our learning so far, how we've approached it, etc. A little bit about who we are as a company. So as I mentioned, we're the largest online retailer in the Netherlands and Belgium were part of the Aholta has a group of companies, and we're a retailing platform. So we sell things ourselves online. But other retailers can also sell their items via our website. And that means that we sell anything that pretty much anything non food, so that can be pets and children's stuff or furniture. And we ship that all from both our partners warehouses, but also from our own warehouses. And we have a couple hundred thousands of pitched that we ship each day with like 10 million active users. We have a pretty involved and very interesting microservice landscape, and we work with about 600, 700 engineers to innovate there. And weve existed and been developing the back end since like 1999 or something. And we've had quite a long journey and we've been doing DevOps for quite a while already. So, yeah, how did we get to sre@bottle.com? So let's quickly go through the journey that we had as a company towards SRE, and then we can go into how that has shaped our vision on it and shaped what we're doing now and what we've learned from all of that. So Bolocom's journey to SRe kind of started in 2009 when we had our agile transformation and we started adopting things from scrum and the like. But as we developed and wanted to innovate more and more, originally we had our operations externally managed by a different partner. And you can imagine that that doesn't really benefit the release cycle. So in 2013, a big project was to bring these operations in house, and we established like a top of the line operations team that managed our mostly, I think we had four or five huge monoliths that we ran at this time, and they ran that in our own data center. And then in 2014, I joined bordeaux.com. It's not really applicable to the SRE journey, but it's just fun to put in there. And then around 2016, we had our DevOps transformation so we had a big project that we called man on the moon. And we had like a catchphrase that was, you build it, you run it, you love it. So this was where we really wanted to push the operations as far to the innovation team as possible. And that was quite successful, actually, at the time. So we provided quite a lot of tooling so that all the CI CD could work all the way towards production, theoretically. I'm not sure if anyone at that time was actually doing continuous deployment, but definitely awesome continuous integration. But also around that time, Google published the first satch liability engineering books. And it was this time that we first hosted our internal conference, which was like the Spaces Summit. And it was at that first internal conference that David Ferguson from Google also dropped by. He's like, this has a reefing that we've just published. You should totally be doing that. And quite a lot of people@ball.com agreed with him. And when in 2018, we were also moving towards the cloud, so big operations team for our own data center, we acknowledged that, yeah, maybe this cloud thing is actually kind of useful, so let's do this. And with that migration, we also started to move towards a product driven organization. So what we mean here is that instead of organizing our teams and our microservices in a way that it's organized around the technical elements, we really organized teams around the value chains that span all the different technical elements so that we can hopefully autonomously innovate within those products and have a less complex web of communication between different products. So now you just need to talk between different products instead of talking between all thousands of microservices. So that's the context of how we came to SRE. And as I was saying, from 2016 on, there were quite a few individuals in the company who were already quite excited about this concept. So in early 2019, we got all those together, and we started with a pilot on how can SRE work well for us now, we had several specific learnings from this pilot that I'm going to go back into later. But more concretely, the pilot was got so successful, parts of it were, parts of it were not so much. So based on learnings of that in 2020, we started with a dedicated, full time SRE team, and this is also a team I'm a member of. And it's with this step that we've really started to get the ball rolling to get this mindset shift that SRE can really give an organization. And I think which is also really one of the major added benefits of SRE over DevOps. Because where DevOps is kind of like a strategy for putting parts of your innovation cycle close together in a single role. SRE really provides a set of practices and structure for thinking about this combination of the two practices of development and operations that really allows you to integrate it for maximum user happiness, I think because over the last year and as part of the SRE team. Right? Yeah. So I don't think in this conference I really need to go into what the concept of SRE is. But just to really highlight it quickly, I think the most important part is that you acknowledge that reliability in any system is your number one feature, but that we also live in a Fuka world like volatile, uncertain, complex, ambiguous. And that these two aspects mean that in order to maximize your user happiness, you cannot just say, we have to reach this reliability and that's final. But you can also got say we can innovate whatever we want because all innovation will bring risk, and when you are not there, it doesn't matter how much you've innovated. So there's a fine balance between these two aspects that needs to be managed. DevOps, of course, theoretically sort of addresses that. But in my discussions with other companies and in research for how can we best solve this problem for Bolcom, we found that there's like two different ways to approach SRE. So you can approach it from the development angle and you can approach it from the operations angle. And what we often find is that developers managers often say they want DevOps, but then when they say that often what they actually want is just better CI CD, it's like a smoother journey from code written to code at the user. And what they often don't really want is like things like having to manage disaster recovery and doing post mortems, instant handling, fire extinguishing stuff like that. So that's then the call from the dev side and then from opsite, people are saying, no, we need to do SRE. And what they actually want often is like reproducible infrastructure, less toil infrastructure as coats, stuff like that. And they don't really want just like deaf by ammo and lose their jobs because I had like a management column at some point here, but it was just all less cost, less cost, less cost. And you often see this mismatch like, oh yeah, we now do DevOps, so we don't need ops anymore. But that's not the way it works because operations is a deep specialization with relevant insight into a proper functioning microservice landscape. And also the way you do these two things together is complex and SRE has a specific opinion about that. And if you're just going to be applying software engineering practices to your operation department, then you're robbing your developers of the opportunity to learn from production. And production is your most direct link to your users'dreams. So in your innovation cycle, you really need to include both your operations and your development. And as such, I think Google's mantra that SRE implements DevOps is vital. So SRe really is a way to get DevOps right. So that's also how we sre it@ball.com. And that means that for our SRE team, our dedicated full time SRE team, weve really set the mission that it has as a purpose to support our innovation team. So it helps develop the tooling and the practices to enable all other teams to find this balance between reliability and innovation in the right way. And that means that the team itself, in essence, does not have run responsibility. This is something that is also quite often challenged from goal because they literally told us like, yeah, the consultancy model, we've seen that fill many times, but for our skill and the way we are working, it really seems the best fit because bringing production, operation and development closer has been of such benefit. So this is the way we're approaching it and let's see what we've learned so far. So if we go back to this slide, and I already announced it quickly when we talked about this pilot, that there were some learnings. So what did this pilot entail? So there were five software and operations engineers and they were given 40% of their time to go forth an SRE because we thought SRE is awesome and it will make everything better. And so you just do this with like two days a week, that should be enough, right? So you can imagine with such a clear direction and purpose, this team really enabled things, or not. So, yeah, it didn't achieve as much as we'd hoped, but there was actually a different pilot running parallel to this which achieved a lot more, but also didn't really reach the impact that was desired. So after this didn't go so successfully, there was really like this idea of, yeah, so we tried it and it's not going to work for us. But a critical part of getting SRE right is learnings from failure. So that's also what we did here. We looked at what we did, analyzed it and there were like a couple concrete learnings from this. And for us, the most relevant one was that SRE as a part time role next to full time software engineer doesn't really work so well because the pressures of backlog innovation always outcompeted our SRE work. And also while we were doing this pilot, there were quite a few, especially managers, who were saying like, oh, lead SRE are a new way to do operations. Then you're really missing again that connection between innovation and operation, but you're just approaching it from the other side. But at the same time, in our investigations discussions, we did SRe that there were particular problems all around the organization that we could solve with SRE, but that we just didn't have the capacity for to solve them properly right now. So this specialty that SRE is and the solutions it offers, it was really something we did need. So based on that, we went for the full time central dedicated SRE team and we wrote a big plan and we sent it to it management and it was like, oh yeah, this is actually a pretty kind of good idea that you got here. So then we had the central SRE team and for 2020, the main purpose of the central SRE team was to increase knowledge about good DevOps practices SRE practices, increase observability of product value chains and fix a problem that we hadn't solved yet of how we do outside of office hour support for cloud applications. So from the knowledge site, we built some trainings and we did presentations. And it was through these different presentations and engagements with different teams that we kind of had some our first setbacks, maybe failures, in that we and the team, we all knew this subject matter intimately. So we were really familiar with the concepts and the benefits it can bring. So we were like, look, this is how it goes, let's go do it. Yeah. And then we gave a presentation like that and people were, yeah, let's go do it. Okay. And then a month later we gave another presentation. Yeah, so this is how it works and let's go do it. Yeah. But even people who were in those both presentations and were excited by both presentations, by the second presentation, there wasn't real progress because in the meantime they had been doing completely other work in a completely different context. So having this large 2500 person organization and getting them to completely rethink the way they think about reliability right now, because again and again the uptime versus availability discussion came up and that's just something that's not trivial to grasp. So we really had to start from scratch there every time we had a new engagement with a new team and then just repeat and repeat and repeat. And we can't expect other people who are got fully focused on reliability to have this same adoption and immersement of the concept beneath it as we who are full time working on it do. So this was like a big learning for us that we can't run ahead of the pack and then expect the pack to catch up. We really have to work together with them and then take them along for the ride. And then if we work with someone else, start again at the beginning and take them along for a ride. And then slowly you're building up a broader knowledge base about these subjects and it'll enter common knowledge. It's a long process and it's not comparable to how quickly a person can learn. So that's what we learned in that knowledge sharing aspect. And then for the value chain observability, first we wanted to go like, yeah, were going to take our most critical value chains and add sre to that. But then we were thinking it's like, well, maybe wouldn't it make sense to start with a smaller value chain so that we can have a quick example of how this works and then we can roll it out easier to other value chain. So we chose the content product catalog value chain. And this is like a selling page on our websop shop. And content just manages all product text assets and product image assets. So that's images like these, the title, the category, but also the main image here. And so we thought like, okay, this is a fairly isolated system. We can make a quick iteration here, and then we'll have a big success and share it with everyone. So we had this look into the whole system, and we looked for what were the most important parts of that system, and how can we get complete coverage. And it turns out we got 45 slis. Well, that's maybe a bit much. So we abstracted it a bit and we made a down select of five slis that we wanted to monitor to observe the value chain reliability on. So we were really focusing on slis that gave us the high over value chain availability. But if you look at this slide, you can see all kinds of different systems in different places. And there were actually many teams, there were like six different teams involved in innovating and working on these places. And by only having enabling slis that go over the whole stack, there was no clear owner of the single Slis. So if you're in a situation like us where there's no clear owner of the high over product slis, what you need is that you also have lower level slis and then a clear relation between the two so that you can engage both the teams on their system level, but also engage the product managers on the product level. And then with that clear relation between the two, you can still get the big benefits of a slice while also having the engagement of the people who are really involved on the different abstraction level. So that's something we kind of did wrong here and learned from. And partly also due to this mismatch in abstraction level, we never really got to the part where the teams were actually using these Slis to actively make that balance between reliability and innovation. We weve really got the aero budget policy properly set up and properly burned through our error budgets and then properly used the error budget policy to recover from that state. And we really think that this is very much due to the fact that we did not have that engagement on the low level, but also we didn't have the right engagement on the higher level. So not so much buy in from upper management, essentially. So what weve really learned here is that while the SLIs can give you great benefits and Slos et cetera, it's really important to engage with the whole vertical stack of the people that are involved in the value chain because you need buy in for the aero budget policy to roll it out over all the different teams. You need ownership on a product level from product management and tech leads, but you also need ownership on the teams to really engage with the SLIs and use that to basically act before there's a problem. So if we sum up all those learnings, we are not Google, our skill is not their skill. So we truly think it's okay to not be Google and do it in a way that uses their experience and the capacity that they offer, but users it in a way that works best for your own company. Then there is that SRE is a full time long term investment. It really is a different way of thinking about both reliability and innovation. And as such, it touches all kinds of aspects of your innovation way of working. So it really takes time to get that mindset shift. So yeah, that doesn't come quickly and you really need to engage with all the levels of the organization to get that properly rolled. But, and especially since it's such a long term process, it's really important to celebrate your successes, otherwise you're not going to have the stamina to ride it out to actual success. Right. So in a nutshell, that was what we learned at Bulletcom over the last one and a half years. I noticed that I forgot to put attributions for the graphics I used in this recording. I'll add them to the slide deck, and I'm sure the slides will be available. So thanks for joining my presentation. Have lots of fun with 42 chaos engineering and I'm looking forward to any kind of questions.
...

Bart Enkelaar

Lead SRE @ bol.com

Bart Enkelaar's LinkedIn account Bart Enkelaar's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways