Conf42 Chaos Engineering 2020 - Online

- premiere 5PM GMT

Shipping Quality Software in Hostile Environments

Video size:

Abstract

Everyone loves features, right? Product loves features. Management loves features. The board loves features. Features are what make the users use and the investors invest, right? They certainly make the media pay attention.

Summary

  • Luka Kladaric is the chaos manager at Sekura Collective. He mostly works as an architecture, infrastructure and security consultant. Also a former startup founder and a remote work evangelist. Catch him here if you want to talk about remote engineering teams.
  • Tech debt is the implied cost of an easy solution over a slower and better approach accumulated over time. It's insufficient upfront definition, tight coupling of components, lack of attention to the foundations. Unaddressed tech debt breeds more tech debt.
  • There's no concept of stable. Touching anything triggers a rollout of everything. One commit takes an hour and a half to deploy. Rollbacks take just as long because there's no rollbacks. Outages become a daily occurrence.
  • Tech debt is a pain index with no upper bound, and it can double by the next time you check your balance. It's incredibly difficult to schedule work to address tech type because nobody explicitly asked for it. A new approach suggests that software development is neither a sprint nor a marathon.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You might be wondering, who is this guy? So, my name is Luka Kladaric. I am the chaos manager at Sekura Collective. We deal with chaos. We don't do greenfield development, we don't do features. We don't put buttons on websites. We won't build your mobile app. If you are drowning in tech, debt, performance or security issues, we'll put you on the right track and then hand off either to an in house team or to a different agency for long term maintenance. I've been a recovering web developer for over ten years. These days I mostly work as an architecture, infrastructure and security consultant. I'm also a former startup founder and a remote work evangelist. I've been working remotely for over ten years, so catch me here if you want to talk about remote engineering teams, remote hiring, et cetera. So what is a hostile environments? In this context, I consider any software development place where engineering is seen and treated as the implementing workforce of ideas that come from outside with little to no autonomy. Usually product or a similar team owns 100% of engineering time, and you have to claw back time to work on core stuff and to quickly loves tech debt, which I'm sure everyone knows what it is, but just to get the definition out, the definition I like is it's the implied cost of an easy solution over a slower and better approach accumulated over time. So how about quick example? Techdet is an API that returns a list of results with no pagination. You start with 20 results and go like, yeah, I'll never grow beyond 100 elements. And then three years later you have thousands of elements in the response. The response is ten megabytes each, and so on. It's fragile code that everything runs through, things you're afraid to touch, afraid of breaking things, introducing unexpected behaviors. It's parts of the code base that nobody wants to touch. If you get a ticket saying something's wrong with user messaging and everyone backs away from their desk and starts staring at the floor, that's a measurable cue that it needs focused effort. It's entire systems that have become too complex to change or deprecate, and you're stuck with them forever because there's no reasonable amount of effort to fix or replace them. It's also broken tools and processes, but even worse, it's lack of confidence in the build and deploy process. Breakages should not roll out. Good deploys should always successfully deploy good builds. It's what you wish you could change, but can't afford to for various reasons. So where does tech debt even come from? It's insufficient upfront definition, tight coupling of components, lack of attention to the foundations and evolution over time. It's requirements that are still being worked on during development. Sometimes they don't even exist. Sometimes you get the requirements when you're halfway done or 90% done. So do any of these sound familiar? We'll get back to that later. Now you won't. Why does x have to be clean when y isn't? You see a pull request and you see a hard coded credential or an API key or something, and you say no credentials in the code base, please. And you get pushback saying, but look at this like cron job from four years ago with the hard coded SQL connection with the password. Hearing this every couple of weeks is draining to answer these questions constantly. That's techdet. People pushing back on good solutions is tech debt. We're a tiny startup. We can't afford to have perfect code a. Nobody's talking about perfect code b. No such thing exists. We're still trying to prove the product market fit. Once you do, you will need a product that somewhat works and the ability to run at 100%. You don't want your gold moment like when the business side works to be the complete technical breakdown of your stuff. When we make some money, things will be different. No, they won't. I see whoever's on the left side here used to live through this or is currently living through this. I love it. It's not just a fact of bad code, bad processes, bad tools, or whatever. It's people learning the bad behavior and then teaching the new hires the same as your team grows. Like if you have five people who have a bad process, just a bad approach to things, and you bring in ten fresh people, there's a chance this works out. But if your company grows to 50 people and you're still stuck in this state, you're going to be bringing people one by one or five by five. The bad people will outweigh the number of good folks by an order of magnitude, and the good folks will just be trained in the bad ways. They will have no chance of as the new person swaying you towards the light. So what's the actual harm? Unaddressed tech debt breeds more tech debt. Once some amount of it is allowed to survive and thrive, people are more likely to contribute lower quality solutions and be more offended when asked to improve them. So over time, productivity drops, deadlines slip, and reasons are manifold. You have some objective reasons, like it's difficult to achieve results with confidence. There's a high cognitive load of carefully treading through the existing jungle to make changes. Quality takes a nosedive and you start to see a clear separation between new and clean stuff, which was just shipped, and older code, which is six or more months old, which is instantly terrible. You will have the web team ship a brand new application like the beacon of light, and then six months later say, that's no longer how we do things and that should not be looked at as a guide. And it's like, what happened in the six months? How did it become the wrong way of doing things? And you end up with no clear root cause for issues or outages. There's nothing that you can point to and say, this is what we should have done six months ago to avoid this situation. It's just miserable. You have tanking morale among tech staff because they are demotivated from working on things that are horrible and people start avoiding unnecessary improvements. Abandon the Boy Scout rule. Leave things better than you found them. It's a death by a thousand cuts situation. One big pile of sadness for people who are supposed to be championing new and exciting work. And to this you might say developers are just spoiled, right? False. It's a natural human reaction that when you see a nice street with no garbage, no gum stuck to the pavement, you're not going to spit your gum there. That's just how humans act. You bring anyone from anywhere to anywhere nice, and that's how they're going to act. You have a bad neighborhood. You chuck your gum on the floor or cigarette butt because it's already littered, it's already terrible. You shouldn't allow your code base or your systems to become the bad neighborhood, because if you're surrounded by bad neighbors, why should you be any different? And it becomes just a vicious, self perpetuating cycle. So, just a quick case study. I like to give you an example of unchecked tech debt over a few years in just a regular startup looking for the right product to sell to the right people. And nobody here is particularly clueless, nobody is particularly evil. Just the way things are done. And depending on your mixture of founders and initial employees, some processes are set better, some are worse. But eventually you end up with something like this. I get a call one day and they say, we'd love for you to come on. As a senior engineer, we've seen your talks, we like your stuff. If you could rotate through the teams and find the place where you can contribute the most, we have a bunch of really great, enthusiastic engineers, but could use some senior experience to guide that and just raise the skill level as a whole in the company. So I go, all right, I join. And within days my alarms start going off. Very nice people all around, clearly talented engineers hiring from like, hiring from the best, from the top schools. But the tech stack and tooling around it was really weird. I can't wrap my head around it. There's a massive monolithic git repository. So you go like, where's the code? It's in the repo. Okay, but like, which repo? The repo, all right, it's a ten gigabyte repository hosted in the office. It's very fast for office folks. It's like local network, but there are remote people in the company. Some dude on an island on a shitty DSL halfway around the world is going to have a very bad day if they buy a new laptop, if they need to reclone or something. There's no concept of stable. Touching anything triggers a rollout of everything because there's no way for the build server to know what's affected by any change. So the safest thing to do is just roll out the universe. One commit takes an hour and a half to deploy. Four commits is your deploy pipeline for the workday. Rollbacks take just as long because there's no rollbacks. Of course, you just commit a fix and they use the same queue. So if 09:00 a.m. Or 10:00 a.m. Let's be honest, you have four commits. That's your pipeline for the day. And at noon something breaks and you commit a fix immediately. It's not going to go out until the end of the day unless you go into the build server and manually kill the, I don't know many dozen jobs that are in the queue, one for each little system, so that your fix gets deployed and then you don't know what else you're supposed to deploy because you just killed the deployed jobs for the entire universe. You don't know which changes should go up. There's a handcrafted build server. It's a Jenkins box, of course, hosted in the office. There's no record of how it's provisioned or configured if something were to happen to it. The way you build software is just lost. You've forgotten how you're building software, and each job on it is subtly different, even for the same tech. You have an Android source code that you build three instances out of like white label stuff, but each of them builds in a different way, so you'll have a commit that breaks one of the builds, even though it didn't touch anything with that specific thing, stuff like that. There's no local dev environments, so everyone works directly on production systems. This is a great way to ensure people don't experiment because they'll get into trouble for working on legitimate stuff. It's going to beat experimentation and improvement out of them like that. People have to use the VPN for everything, even nontechnical stuff. So I'm talking about like support and product and VPN failure is like a long coffee break for the entire company. And just written code is hitting the master database. There's no database schema versioning changes are done directly on the master database with honor system accounting, so half the changes just don't get recorded because people forget. And there's no way to tell what the database looked like a month ago, no way to have a test hostile environments that is constantly the same as production. Half the servers that you have are not deployable from scratch. This almost guarantees that servers that should be the same. So like three instances of python are different and you don't know what the difference is because you have no way to enforce them being the same or even worse, their deployability is unknown or hasn't been tested, so assume it doesn't work. Code review tool is some self hosted abandonedware. It's bug ridden, unsupported, very limited. It's literally like the thing that people use to develop software is constantly enforcing some limits. Outages become a daily occurrence. The list of causes is too long to mention. You have the main production API server falling down because a PM, a product manager committed an iOS translation string change. Of course they did it on a Friday 07:00 p.m. Like after hours on the other side of the planet. So it's 02:00 a.m. For me where I am. And individual outages are just not worthy of a post mortem because there's no reasonable expectation of uptime and everyone is focused on shipping features. And you get that. Let's just get this one done and then we can talk about refactoring every single time. You can't refactor eight years of bad decisions. You start approaching the point of rewrites, which are almost always a bad idea to begin with. And every time you say just this once, it's another step in the wrong direction. So how do you even begin to fix this? This is just a terrible situation. This is what I walked into. This isn't like six different places. This is one place and I've left out the more esoteric stuff. So how do you even start to fix this? I like to call this state not tech debt, but tech bankruptcy. It's a point where people don't even know how to move forward, where you get a request to do something and people go like, I don't even know how to do that. So every task is big because it takes hours to get into the context of how careful you need to be. Now, at the time, the infrastructure team was staffed with rebels, and they were happy to work in the shadows with the blessing of a small part of leadership. So to stop further deterioration and incrementally improve things, I kind of wiggled my room into wiggle my way into the infrastructure team as a developer, and we started working on things. It took over a year and a half to get to the point where I didn't think it was just completely terrible. We wrote it all down. Every terrible thing became a ticket. We had a hidden project in our task tracking system called Monsters under the bed. And whenever you have a few minutes, if you have two meetings with a half hour in between, you open the monsters and you think about one of them, you find a novel way to kill it. The team worked tirelessly to unblock software developers and empower them to ship quality software. And most of the work was done in the shadows, with double accounting for time spent. How? Build server rebuilt from scratch with ansible in the cloud, so it can easily be scaled up or migrated or whatever. We now have a recipe for the build server, which knows how our software is built and deployed. Build and deploy jobs are defined in code. No editing via web UI whatsoever. And because they're defined in code, you have inheritance. So Android is just like, here's how we build Android. And then if there are differences between them, you extend that job and you just define the differences. The monolithic repo was slain. We split it up into 40 smaller ones, and even, like, even the first iteration was slain again into even smaller repos. While preparing this, we found three other proposals for killing that repo. And all of them had an all or nothing approach. And it was either pause development for a week to revamp the repo, or cut your losses, lose all history, start fresh, things like that. We built an incremental approach, split out a tiny chunk, pause development for an hour just for one team, and move them to a new repo with their history, with everything. Infrastructure went first, showed the path towards the light to the others. We set up a system where changes trigger a build deploy only on the affected project. And commit to live is measured in seconds, not hours. Some teams opted out. Some teams said no, we like the old approach. They were allowed to stay in the monorepo. A few months later they joined the party. They saw what the other teams were doing and they were like, we want that. All servers were rebuilt and redeployed with ansible. This used to be some 80 machines, 20 different roles. This was all done under the guise of upgrading the fleet to Ubuntu 16. Nobody understand what that was. Nobody asked how long it would take. But whenever someone asked about a server whose name had changed, we would just say, oh, it's the new Ubuntu 16 box. What actually happened is in the background we wrote fresh Ansible to deploy a server that did kind of that thing, and then we iterated on it until it actually could do that thing. And then we killed the old hand weaved nonsense and replaced it with the ansible stuff. Migrated to modern code review software, migrated away from self hosted git hosting and self hosted code review to GitHub VPN. No longer needed for day to day work, not just dev work. The only thing on the VPN now, the only thing that you connect to the VPN for is the master database, and nobody has write access anyway, even when you're on the VPN. So it's not really useful for Dev work, it's useful for phishing out weird things in the data local dev environments. No more reviewing stuff that doesn't even build because people are afraid to touch it. And no more running untested code against production and version database schema changes through code reviewed SQL scripts. So there's a code review process for SQL scripts that actually affect the master database. And with that you can have a test database because you know which changes are applied and you can keep them in sync, etc. The moral of the story is, don't wait for permission to do your job right. It's always easier to beg for forgiveness anyway. So if you see something broken, fix it. If you don't have time to fix it, write it down, but do come back to it when you can steal a minute. And even if it takes months to make progress, it's worth doing. So the team here was well aware of how broken things were, but they thought that that was the best they could do. And it wasn't. It wasn't. But if we had pushed for it to be a single massive project, we said okay, project, let's do things right from now on. It's going to take a year to have any measurable impact, it's going to take this many engineers full time for however long, it never would have happened. It would have gotten shot down at the highest level instantly because we can't afford that. Instead, we turned a small team into a red team and just went and did it. Job well done. Take the rest of the week off. Right? Except that's not how things should be. It's horrible. It's horrible. Let's admit that it's horrible. You have a team of people directly subverting the established processes because they are failing them. And you have managers going, that's fine, you go do that. Don't tell anyone, and I'll protect you. But pretend I didn't know. It's terrible. So how do we do better? Tech network is very difficult to sell. It's unmeasurable amount of pain that increases in unmeasurable ways. And if you put in some amount of effort, you get unmeasurable gains. It's terrible, right? How do you convince someone to go for that when they have clearly measurable impact of feature work? And the name itself, tech debt, it implies that we have control over. It implies that we can choose to pay it down when it suits us. But it's not like paying off your credit card. With the credit card, you get a number every month, and you go, okay, this month I'm going to party a lot because it's my birthday or whatever, so I'm going to pay off 10% of this card, and the next month I'm going to stay in my house. All month, we'll pay down 20%, and then pretty soon we'll be out of debt. With tech debt, there's no number. It's just a pain index with no upper bound, and it can double by the next time you check your balance. It's incredibly difficult to schedule work to address tech type because nobody explicitly asked for it. Nobody asked for you to improve the code base. They want something visible, something measurable to be achieved. And it directly takes time away from a measurable goal, which is shipping features. There's a quote that I love from a cleaning equipment manufacturer, and it's if you don't schedule time for maintenance, your equipment will schedule it for you, and things will break in the most inopportune time possible. So I think it's time for a new approach. I recently came across an article that changed the way I think about this, and it's called sprints, marathons, and root canals by Goikwajuj. It suggests that software development is neither a sprint nor a marathon, which is a standard comparison we often hear as they both have clearly defined end states. You finish a sprint, you finish a marathon. Software development is never done. You just stop working on it because of a higher priority project or company shuts down or whatever. But it's never done. It's never achieved an end state. You shouldn't need to put basic hygiene like showering or brushing your teeth on your calendar. It should just happen. And you can skip one, but you can't skip a dozen. Having to schedule means something went horribly wrong, like having to go in for a root canal instead of just brushing your teeth every morning. So, new name I've fallen in love with this sustainability work. It's not paying down tech debt. It means making software development sustainable. It means you can keep delivering a healthy amount of software regularly instead of pushing for tech debt sprints or tech debt days, which every time you do it feels like you're taking something from the other side. It needs to become a first tier work item, and it's okay to skip here and there. You can delay it for a little bit, but the more you do, the more painful the lesson will be. Agree on a regular budget for sustainability work, either for every team or for every engineer some percent of time, and tweak it over time. It's a balance, and there's no just magic number. It depends on where you are in your lifespan, of the company and the software and so on. But work on sustainability stuff all the time and there's no need to discuss with people outside the engineering team. Engineers will know which things they keep stumbling over. They will know what their pain points are. So just allocate a budget for sustainability work and let them figure out what they're going to do with that. Time loves time. You can discuss what the effects of that were and whether they need more, or if they can give back some time. And it's fine. If a deadline is coming in and you say, let's pause sustainability for a week or two, for a sprint or two, it's fine. But it needs to first be there and then you give it back to product. If there's crunch time and it helps with morale, there's nothing worse than being told you're not allowed to address the thing that's making your life miserable. So term techtech sends the wrong message. Call it sustainability work. Make sure it's scheduled as part of the regular dev process, not something that needs to push out feature work to end up on the schedule. That's it for me.
...

Luka Kladaric

Founder & Chaos Management @ Sekura

Luka Kladaric's LinkedIn account Luka Kladaric's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways