Conf42 DevOps 2023 - Online

How to move fast without breaking things!

Video size:

Abstract

Good teams ship fast to innovate. Great ones ship fast, while keeping quality at a high level. I’ll share my experience working in Big Tech, innovating and moving fast while building/maintaining very large-scale distributed systems. We’ll start from the big picture, how to approach such challenges, a bit of complexity theory and then zoom in to more technical concepts that can be applied in different scenarios. You will take away techniques and concepts to apply on your next big challenge. Some are: - The kind of team culture that will maximise your chances for success - How to plan such work, while aiming to move really fast - Common trade-offs you will face and how to approach them - Resiliency techniques that can save lots of time & effort down the road - How to adopt systems thinking on our day-to-day as a team

Summary

  • Welcome to my talk. I'm recording this video from Spain where I'm based. If you have any questions, comments, ideas, or just want to say hi, feel free to reach out to me after the talk.
  • Facebook adopted move fast with stable infra. But different times call for different measures. When you ship things that every once in a while breaks, and when you break things you have to go fix them. This could bring down the morale of your teams a lot.
  • I've been working in tech for ten teams for the past ten years as a software engineer. Now I'm building in Shopify and I'm part of the shop app team. Set your priorities clear in the beginning will make your team work on the most important thing at any given time.
  • The concept was coined by my former employer, thoughtworks. It's called evolutionary architecture. It makes architecture a living entity. We can apply things function to our architecture and see if it matches our priorities. This is a good heuristic for designing a base architecture.
  • Aims to start a base level of architecture with a lower complexity so that it's really faster and easier to grow and let the architecture evolve. Spending 10% extra time to build the architecture with the lower complexity will avoid 90% of the issues later otherwise.
  • First do the things that has high value and high complexity. You should prioritize these type of parts. Parts that are complex to build, risky parts that holds unknown. I think you should really consider having a walking skeleton from week one.
  • integrations is usually in this big, complex system. These are the parts where most incidents and issues occur. How the secure integrations are two, like downstream integrations versus upstream integrations. One thing that is really important is to implement them in the beginning.
  • So the next things is testing. There are two things we need to consider. Load and diversity. One good practice for this is shadow releasing. This is a really good way of testing that will give you load and most importantly diversity.
  • So the next one is building for resilience. I think it's a really good practice to map out possible problematic and error scenarios. It's really crucial and super helpful to make the decision how to react to these before they actually happen. Other good patterns is auto scaling warm up.
  • The concept of immutability is getting highly popular in the last ten years with data storages. The next one is compartmentalizing. Let us deprioritize less important non time sensor tasks. And the last one is ones time configuration management.
  • edit while you build. We should have observability from day one and alerting as well included into this. Performance testing is great for this. Put your system under high load, high stress. If the performance is something critical for you, start testing early.
  • You can apply to your day to day job. Reach out to me in my email or via LinkedIn. I hope you enjoyed it. Really happy for the chance and hope to hear from you.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Welcome to my talk. I'm recording this video from Spain where I'm based. I hope you enjoy, I hope you're having a nice day and I would still very much like to interact with you. So if you have any questions, comments, ideas, or just want to say hi, feel free to reach out to me after the talk. I'll share my contact details right at the end. So let's go. Move fast without breaking things probably you heard the opposite of this phrase from here, where it started, like, I think maybe ten or 15 years ago, where it became Facebook's. One of mottos of Facebook. And this is funny, actually, because I remember I first seen this phrase in my first office, one of my first offices in Istanbul, Turkey, where I was working in one of the largest banks in Turkey as a software developer. And this phrase was written on top of the wall right next to my desk. And it was a bit confusing because things bank is run with strong hierarchy, stability and quality is everything. Like perception of quality is everything, both individually and both of the systems we build. And to be honest, we deal with money and it's a banking system, so there's not much room to. Room for error or to break things. So it was a bit confusing for me, bit for the case of Facebook. It actually makes a lot of sense and it's great, it's a great vision as what great leaders should do. They set great visions and I think move fast and break things states that pace is the priority, even on the cost of breaking things. They wanted their employees to feel safe, to not fear about breaking things, because they want to innovate. Facebook created new user habits for us, they invented new user habits for us and wanted to go further. They wanted to invent newer things, they wanted to invent better, faster things of interacting in web. So that's why they went with it up until it became this. So not too, like maybe, let's say recently, a few years ago, Facebook adopted move fast with stable infra. So they're both valid approaches, but different times call for different measures. So why would any company would prefer this move fast with still having a stable infrastructure, stable performance over just pure pace? So of course, there's an obvious reason where now user base expects a certain level of threshold of performance, you want to comply with the standard of quality. But there are a few other things, maybe like, not too obvious things that I want to mention here to consider high or good performance. First of those is experimentation, where I've seen this in my workplaces, where we've shipped an experiment we shipped a new feature can idea as an experiment to get feedback to see how it's going to perform. But the thing is, what we missed was the specific feature idea was performing like the performance to quality was a little worse than the actual product because we just wanted to be fast. But then the experiment results were the users didn't like this idea. But actually soon after we found out that it's not that users didn't like the idea, it's just that that feature was slow, it wasn't performing well, it wasn't really in high quality that users didn't value, users didn't want to use it, they dropped out. So during experimentation, it's really crucial that you meet your current quality standards so that users don't, even if they don't notice this consciously, they wouldn't just want to drop out, they wouldn't want to be annoyed with your product. And the next one is morale. So when you ship things that every once in a while breaks, and when you break things you have to go fix them. And when you ship things that breaks often, then you have to go fix them quite often and spend a lot of time on it. So every once in a while this is fine, but if it happens quite frequently over a long course of time, in the mid and long term, this would actually bring down the morale of your teams a lot. This is also something you want to consider. And the last thing I want to mention is I think things is something I really want to emphasize, which is more than often spending 10% extra time and effort will probably prevent 90% of the issues you will face, that or otherwise. I think this is almost always free in my personal experience and I think with the content that I'm just going to be telling you about, I'm going to be supporting this idea. And maybe now it's time to talk about why am I talking about this? Why not for something else? Or why not someone else is talking about these but me? I think it's a good time to explain myself, like my past, my experience. So I've been working in tech for ten teams for the past ten years as a software engineer. I call myself a product management engineer. I love solving users problems and I've been grown from a junior engineer for technically doublet teams. I've mostly worked in thoughtworks in the past where I've started in Turkey, and I've then worked in Germany, India and then Spain finally where I settled. I believe we came for the past five years and I worked in various different domains, banking, e commerce, pricing, breaking and then in new relic I moved to new relic to the observability domain. I've had amazing few years and then since the beginning of this year I'm working on Shopify. So I worked a lot with those individual able to I had the luck to build that muscle and I'm really interested in complex systems and how people I'm fascinated by the characteristic of complex systems, the complexity and especially how people struggle or can manage with that, deal with that. So I've led teams that own and build or been part of teams that own and build really high scale systems that demand really high quality like for example neuralic processes and serves telemetry data. They own perhaps the biggest Kafka cluster in the world and requires excellent operational quality because actually that's the product where other companies need when things are going tough, or during critical data like Black Friday, Cyber Monday, which is happening now, where I'm recording this, or now I'm building in Shopify and I'm part of the shop app. My team is responsible of helping our users attract their packages, track their orders and in updates we process more than 20 million of status updates, track order updates to be able to let our users know the latest status about their orders. And while dealing managing this complexity, my teams have always been under pressure of moving fast. And I want to share my experience of how did we balance those two things. So it all starts with planning, right? If you want to go really fast, but also keep a high level of fidelity, you need to plan accordingly. And first thing I've been really happy about when we achieved this was setting our priorities, really setting our priorities straight. So that will come daytoday for a lot of people or more than one person in the team, where we'll need to make a decision between time, performance, scope, quality, security, things like this. We need to choose one or the others. There's this trade off coming. They'll come a day that you need to make a choice within this trade off. And I think a teams should be doing the most important thing at any given time. So this is my motto. And in order to be able to do that, consider one Wednesday morning you show up to work. There's two things you can do. Or someone in your team, like you can just do something to make your application more secure, or you can do this other thing that's a performance optimization. So which one you would do? Instead of just relying on making the right choice every time for individuals, I think we should just set our priorities straight, unknown, and broadcast them clearly, so that this choice is straightforward, so that this choice is almost known. I think setting your priorities clear in the beginning will, as I wrote right here, will make your team work on the most important thing at any given time. And how do we achieve this? So there are like workshops like trade up sliders and such. Bit, you don't have to go in a full on formal practice bit. Could be just a 20 minutes talk. I think it just makes sure to reach the stake. And once it's done, of course we have a problem. We have an idea. We're going to design, right, in big tech, mostly, we have this already running complex architecture, and we're going to add to this, we're going to add some logic, possibly some infrastructure. And how do we go about this when we still want to move really fast? Because many things can impact us, architectural decisions, right? Like, if we believe this system is going to grow in a certain way, if we need to scale in a certain way, if we believe we may add some logic in a certain way, they all factors in, impacts this decision. But there is this one concept I wanted to talk to you about is that's actually coined by my former employer, thoughtworks. It's called evolutionary architecture. So when you put the subjective evolutionary in front of architecture, I think that implies few things. One of them is it makes architecture a living entity. Now, we accept architecture is a living entity and it changes, right? Bit evolves, and now we have a choice then, right after, we can just let it drift away, let it evolve naturally, or we can make it consciously, right? We can change it consciously instead of letting it drift in time. So how does this gets applied is we create. Thoughtworks have defined this way of, you create a fitness function. So, by the way, you don't have to, I think, apply this formal practice or adopt this at all bit. I think it's really good to understand the concept. So you create a fitness function. A fitness function is a term that's been borrowed by evolutionary biology. It describes the likelihood of survival of fitness, survival of the fittest. It describes the likelihood of a species to survive. Likelihood to survive of a species. So then here, the fitness function actually represents our priorities, right? We can actually apply things function to our architecture and see if it matches our priorities or not. For example, in this case, in this example that I put here, high throughput, for example, is more important than low latency, for example, or data security is more important than usability. So once you define this, then, and make sure your team is clear about this, you can design your architecture accordingly. So the next thing is I want to mention is I think this is a good heuristic for designing a base architecture, because as we said, as we now accept our architecture, especially when we are moving fast, it's a living entity and it's going to grow and evolve. So I think it's a really good heuristic to start with, keeping complexity low. And maybe I want your attention on the left side of this or the right side of the screen, where is, there's a chessboard, right? Like why is that? So? I put this, because if you ever imagine we're looking at a chessboard, like I take a photograph of a chessboard in the middle of a game, you're not either one of the players. Sometimes, especially if you're not too experienced with chess, it's really hard to understand the strategies of each player, or it's really hard to understand what is the reason, what's the purpose of each element on a table, each element on the board, and what is the strategy of either player and what's going to happen next. It's really hard to understand the behavior, what's going to come next, unless you're a really experienced chess player or really know about things player. And that means there's a hidden complexity right there. We don't know what to expect from system. We don't know what is next. And this is exactly what we want to avoid in our system. If you look at your architecture, we want architecture to express its behavior, to not hide it, to really explicitly, as explicit as possibly express the behavior so that you can actually grow in it. So I think that's why I feel having a base level of. So if you want to make a decision, if you want to make a trade off designing your architecture between complexity and performance, between amount of resources, I think it's a really good heuristic to start a base level of architecture with a lower complexity so that it's really faster and easier to grow and let the architecture evolve, which I can connect back to the code that I have just told you more than often. Spending 10% extra time, perhaps to build the architecture with the lower complexity or have more resources or have more latency, will avoid 90% of the issues that you'll face later otherwise, and will make you grow your architecture, evolve your architecture faster. So next is you have this big design. Now you need to split it into parts, split bit into chunks, so that people can work on that, that will complete the puzzle and become your product. Right? So how do we go about this I think a great rule of thumb is you first do the things that has high value and high complexity. You should prioritize these type of parts. So that basically translates into parts that connect the pipes and build a walking skeleton, for example, parts that are complex to build, risky parts that holds unknown. So basically, again, a rule of thumb is you face the tough things first. You face the tough burst that may uncover risky things that maybe your team lacks skill. Maybe it's a complex logic that you need to build. Maybe it's a complex infrastructure that you want to face those things first. And also, instead of just going, just to go back to the first point, instead of building parts of a body separately, imagine building parts of the car separately and connecting them in the last week, at the last minute. I think you should really consider having a walking skeleton from week one so that you get feedback. You experience running it, you know how it feels like, and I think you may uncover things you wouldn't otherwise. Things is something I cannot emphasize enough. The benefits of bit. So the next part is now we're going to more practical things is integrations. So integrations is usually in this big, complex system, this is the most critical and delicate parts because these are the parts where most incidents and issues occur, and they need to be treated really with delicacy and they need to be secured. So we can divide this. How the secure integrations are two, like downstream integrations versus upstream integrations, and there are different patterns and you can apply them. So for downstream integrations, you can apply timeouts, retry, backup policy, circuit breaking. I'm not going to details of them. I'm going to talk about a few books. I'm going to mention a few books that you can learn really deep information and how to do things and get experience of those. But like timeouts, for example. But the main idea is simple. If a part of your system breaks, it's like a system design principle. If a part of your system breaks, so the rest should be working as best as possible. And in order to do that, you need to be able to isolate the failure. Right. So if a downstream system is downstream service is failing, instead of waiting for that service for 30 seconds, you should time out and then use that resource instead of waiting to do some other stuff. Right? Or for example, let's say circuit breaking. If a downstream service is really having a hard time under high load, et cetera, instead of hammering it and having errors all the time, you should just do circuit breaking and give that downstream service time to breathe back up. And then you can try it again. And for the upstream direction, it's basically similar idea, similar principle. If a part of the system breaks, you shouldn't let that failure leak into the rest of the system. So there are practices called bulkheads, load shedding and weight limiting. So bulkheads, for example, compartmentalize your resources so that the failure doesn't leak to the other resources. Or if you have some issue like a high load that's eating up all your resources, you compartmentalize it so that it doesn't eat up all your resources that impact other parts of your system. So load shedding and rate limiting as well, they make sure your system performs still at a maximum capacity, even under high load. Doesn't let your system gets hammered, basically. So one thing that is really important I feel that I want to mention here is I think you should implement them in the beginning. You should implement them while you build your integrations, don't add them. I've seen practices of habits of adding these things while productifying the new implementation, one week before shipping, before releasing it. I think there's an anti pattern. You should add them as soon as you build the integration so that you can actually test them, you can tune them, you know, how they react, how the system reacts. Because sometimes these things are really hard to predict with the complexity increases in the bigger system. So you face the error scenarios, you face the tough situations again while building as early as possible. So the next things is testing. Of course it's great if you have a. I'm going to skip like test and pyramid how you write tests. I think we're going to approach a little more high level in terms of testing how you test your product. I think it's great if you have a staging and testing environment that you can ship to and you can, while building before you release it to your users, you can actually get feedback, you can actually see how your system behave. There are two things we need to consider. Make sure we cover here, load and diversity. Load, as we need to make sure we test our system with a load that it's going to see once we release it. Right? Because we need to face those scenarios before our users do. And the other one is diversity. Like, don't test your, if you're dog fooding, if you're testing your features yourself, don't test it with just one user. Try to create as much diverse scenarios as bit can represent real life. So one really good practice for this that you can apply is shadow releasing. Right. Shadow releasing means you release a feature. I think big tech does this all the time. You release a feature, but only a portion of your users without breaking, making those big glass man or without marketing about it. You just release it for a portion of your users so that you can get the most scenarios you can get, you can get the most feedback. Find out edge error cases. This is a really good way of testing that will give you load and most importantly diversity that you need to make sure your product is working well. So the next one is building for resilience. This is a really good heuristic. I want to put as much content as I can put in the slide actually, so maybe you can take a photo screenshot even. So, building for resilience, what does it mean? So I think it's a really good practice to map out possible problematic and error scenarios. For example, what those scenarios could be. For example, there could be a sudden increase in the ingress load. Your database may become bottlenecked. You should map this out and imagine how it would play out. Or for example, one of your downstream API calls, again, being that being threatened, what happens next? Or your cloud provider is having issues, your caching clusters unavailable. It's really crucial and super helpful to make the decision how to react to these before they actually happen. Instead of a Saturday night, 02:00 a.m. You should actually just one person on call woke up from the sleep making this decision as a team. You should make this decision before and then perhaps build the tools around it so that you can actually overcome these problems as best as possible and document this. Have a runbook. Other things I can mention about building for resilience of a good patterns is auto scaling warm up. So most infrastructure technologies, infrastructure providers support some kind of auto scaling. If you know how to do this, it could be really good, helpful, and if used consciously. And also warming up. Like for example, if you know, every Monday morning, or let's say every Saturday night, you have a huge load, you can actually warm up your infrastructure so that you can be resilient under high load. So the next two things I think are really important and we should be opting in for these concepts whenever we can. First of them is immutability, a concept that's been getting highly popular in the last ten years with data storages, with more data analytics and data processing work is getting higher, getting more popular, getting more frequent. So one important thing that immutability does is it let us reach right parts of our flow? Right? If part of your flow is immutable, if there's an error happen, if there's high load, et cetera. If there's an issue happen, you can just retry it on later and it gives you a lot of power to overcome issues when they happen, which most of the time it's not a matter of when, it's not a matter of if, it's a matter of when. The next one is compartmentalizing. So compartmentalizing let us deprioritize less important non time sensor tasks, for example. And the other thing is you can scale them separately. Like if for example, you're reading a message, like your system reads a message and then does some job with it. If you compartmentalize reading a message and those different jobs that you need to do, for example, notifying your users, maybe you can on a high load time that you need to send 50 million notifications, you can just scale it separately without touching your whole app. It is really powerful. And the last one acquisition is ones time configuration management. These are more or less really well adopted and quite straightforward approach, but it really helps to be able to have a configuration management that your system can read without the need to deploy new code. Like for example, you can have kill switches, you can change your infrastructure, even you can number of replicas, or like you can change certain thresholds during runtime depending on how the environment of the system is, depending on your high load, low load, depending on if you have an issue or not. I think this is a really good tool during tough situations as well. So the next one is observability. I'm going to move myself just here. Perfect. So first thing about observability is added while you build it. This is really crucial. Again, don't add it while productifying your product in the last week, add it while you build it because it's really easy to miss. I think it's really straightforward to add it while you build it instead of being lazy, let's say with respect, instead of just adding it later. Because later in the last week, if you're trying to add some metrics of the flow that's been implemented a few months ago, it's really far more easier to miss things. And it's really painful to realize during an incident you miss a metric that you wish it was there. You can just go back and edit. So my first recommendation would be edit while you build. And I think the other thing is other good recommendation. I will have to start with the question. Start with the question and how do we know if our product is working well? Like what could be the data that shows us, prove us our product is working well, this could be, for example, a success rate of an API call is above this level. A response time p 95. Response time for any given user request is below this level. A number of requests for per seconds around this bit could be things like this. You start with a question, but also know that you may not know all the questions that you need. So emit data generously, really generously. There are efficient libraries. I think this could be a good heuristic and start alerting from day one. Like start getting those alerts so that you can tune them, test them and know how to react them. And so this is a mode of system bit. I'm going to talk about modes of systems where systems behave differently under different environments. So I think it's really great we put our system in and sometimes those are really hard to predict because of the complexity that these systems contain. So it's really good to put our system under harsh conditions before those conditions happen when we're not expecting, right? So performance testing is great for this. Put your system under high load, high stress, so that you see how it behaves. And if the performance is something critical for you, start testing early. Put it as a gateway. Put it as a gateway so that you can make sure after every chances to your logic, to your infrastructure, you're not regressing. Right. And another superpower thing is that I cannot recommend enough is run game days like put your system under harsh conditions. If your club conditions, as your cloud provider is unreachable, your cache cluster is gone. So simulate these things so that you face it before they actually happen in real world. So let's recap. What have I said? I said, set your priorities clearly, broadcast them, make them known. Your architecture will evolve. Let's adopt evolutionary architecture. And a good heuristic is optimizing initially for less complexity. Have a walking skeleton from week one. Face tough task first, we need the secure integrations. They're important parts of our system. We should map out our incident scenarios. Create a run book with a detailed explanation of what to do during tough times, during problems. Build optimizing for resilience, immutability and compartmentalizing as our friends are our friends. Observability. We should have observability from day one and alerting as well included into this. And performance testing probably is a really good idea, even if performance is not a high, high critical priority of yours. Thank you so much. I tried to really go a bit fast. I may have skipped this bit. Feel free. Reach out to me in my email or via LinkedIn. I hope you enjoyed it. I hope you have got some stuff from out of this that you can help. You can apply to your day to day job. Really happy for the chance and hope to hear from you.
...

Sinan Kucukkoseler

Senior Software Engineer @ Shopify

Sinan Kucukkoseler's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways