Conf42 Cloud Native 2021 - Online

The Need for Speed (in your CI/CD setup)

Video size:

Abstract

What’s the longest CI/CD build you worked with? Mine once took whole day. This is a tale of creating and taming that monster build.

When developing software and maintaining CI/CD and testing pipelines we are often compelled to increase our test coverage by adding more tests, and therefore improve our apps’ quality. After all, more automation equals better software, right?

There’s a flipside to this equation however, and a point at which we start seeing diminishing returns from each test we add. Taken to extreme, these diminishing returns begin to actively harm our ability to deliver working software.

In this talk we will look at a tale of creating and taming a monster of an all day build (one that really happened to me once), and cover tips, tricks, and tools to help you avoid that scenario in the future - from obvious suggestions such a s adding resources to your build machines, to less obvious ones like removing tests altogether.

This talk will cover tips, tricks, and tools that help you speed up builds - from adding resources to test splitting to bringing a hatchet to your test suite.

Summary

  • Zan Zan: What's a fast build to you? Or on the other hand, what's a slow build for you? Have you ever had to deal with something so agonizingly slow that you thought about turning the CI CD off and just going manual again? Zan: My talk is about how we could make this better.
  • When do you know you have a problem with slow builds? The more you measure, the clearer you can present it to yourself, the better. With Circleci, we have these insights feature to drill down into your builds.
  • For us, first thing to increase horsepower is add more resources to your builds. Next thing to do is start thinking if you can go parallel. By that I mean speeding up all your work by running it in different streams. We can also utilize the cache to essentially deduplicate parts of our work.
  • What is the right build time for your team? Time to recovery tends to be more important than actual kind of build times. Your signal, your CI CD's cargo, if you will, is the most precious cargo. Please reach out to Zan Markan if you have any further questions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Good morning, good afternoon, good evening, wherever you are. Thank you for having me over at Conf 42 to present you my talk titled the need for speed. First off, I'd like you to have a think for yourself about what's a fast build to you? How long does it take? Or on the other hand, what's a slow build? What's a frustratingly slow build for you? And to continue from there, have you ever had to deal with something so agonizingly slow that you thought about just kind of turning the CI CD off and just going manual again? I certainly have. And my story of my worst ever built in my life went something like this. I wake up one morning and first thing I do is obviously check my twitter and next thing I do is check my email. Just naturally comes like that. And at top of my list of emails is I have an alert from my CI tool and it's saying that the build has failed. Build from yesterday. I mean, that's fine, right? We can make some coffee, open up the laptop, make some changes, think, okay, this should do it. And commit, push, forget about it, go to work. Spent all day at work and doing all that developer stuff these developers do. Anyway, the story is some six years old from 2015 ish, I was working on a mobile team, and we were really following the best engineering practices that we knew at the time. So essentially everything was written according to the solid principles. We were adhering almost religiously to clean architecture, and we were also writing everything test first, and not only unit tests, so like TDD, but also functional tests first. What this meant is that every single feature, every single part of the application, before we would start developing, we had a functional test agreed on with the product manager of what this feature should actually do. For example, if you were doing email verification, that's the functional test that you had and those functional tests that were testing the UI. So as far as we were concerned, everything was 100% tested. We had tests before we actually started development and this was the best practice that we could ever think about. Anyway, I'm at work actually just finishing up my day when I get another email. And yes, you may see where this is going. It is from the friendly neighborhood CI. Guess what it's telling me? It's telling me that these build has failed again. Still. I mean, it took a bit of time, it's the same build that I kicked off in the morning, but that's still fine, right? It's just two failing builds in a row. What could ever go wrong? This happens. I must have missed something. And so I'll just make some more changes, kick off another build, go home tomorrow morning, everything is going to be better. Right? Well, this would have been a short talk if it was, it wasn't better. And the day after, it also wasn't better. It still wasn't better. It went on for several days or maybe even weeks. Before you know it, we are facing this long, long queue of features, bug fixes, improvements, refactorings, everything. We have all this kind of code, new code ready to be released, but not being able to be released because we were set here standing, sitting at this kind of junction, looking at the red signal from the CI. We couldn't really release anything because, yeah, we were just stuck with a failing build. Yeah, I mean, this is kind of the point where you're starting to think, do you just turn around and go manual? Just like that ship that blocked Suez Canal a couple of weeks ago? I imagine the captains of ships behind this waiting to get across, to go through Suez Canal, they were thinking about, okay, do we kind of wait for this to clear up, or do we kind of turn around, go around Africa and, yeah, to get to our destination, whether that's Europe or Asia. So yeah, that's the equivalent, DevOps equivalent of waiting on a build to clear and considering going manual. But the thing is, every red build, even if it's super agonizing, even if you're struggling very hard by it, if you're really being frustrated, it's still sending you a signal. And that signal is why we have CI systems in place in the first place. And instead of starting to think about how we could bypass it, we should instead start to think about how do we make this better. My name is Zan. I'm a developer advocate at Circleci, and I have broken many, many builds, definitely more than I would wish to admit. If you want to get in touch with me, I'm quite active on Twitter and you can also email me at zan@circleci.com. But yeah, this talk is not about giving up. This talk is not about turning CI off and going manual. This talk is about making your builds faster and less frustrating in the process. I mentioned I work for Circleci. We're a CI CD platform. While all the examples are going to be agnostic of technologies, of tools that we'll be talking about, I will be sharing concrete examples of how to do things with Circleci. So for that, I would take a couple of minutes to talk you through how Circleci looks, works, and how it's configured. So you have the better context of how everything falls into place. So the core of everything is the pipeline, which is what gets triggered when you commit something, you push some code that gets triggered and taken, kind of runs everything that you want it to do, from running tests to deploying this to wherever you need to go. Whats pipelines is defined in this circleci folder at the top of your repository you'll have it. And in this config Yaml file. In these file, as I said, everything is defined including jobs, orbs and workflows. So jobs are your idea of whats needs to happen and where. So where being the environment. So that could be a docker container, whats can be a virtual machine, we call it executor. And in this environment we also specify a set of steps. So steps are instructions that we want to do or instructions that we want to make in order to go along with that build or that job. That can be command line instructions like NPM run test, or maybe even things that tell Circleci to do something like checkout. We'll kind of check out the code at that particular commit. Or they could be more complex commands that are even stored with these things we call orbs, which are like reusable sets of config that you can pull in from our central repository and then have access to common commands for like for example kubernetes commands or docker commands. You can either write it all manually or use an orb to do it for you. Last thing to do is after we've kind of defined all our jobs or our steps is put them into a context of a workflow. And workflow essentially lets you order and arrange those jobs so that they get run however you wish to run them. For example, you can say, okay, first run all these jobs and then only when these pass run something else. So that's the configuration, it's only one yaml file. The other part is the dashboard, which kind of gives you this view of what's going on in your entire organization. So I'm looking at my own GitHub organization, but you can have your company's organization, anyone. And in there you will see all the projects. You can go into a project, for example, let's go back to the ones that I was looking at earlier. So this API here you can see all the pipelines that have been run. For example, we've run it like 96 times. And inside of a pipeline you can see, okay, these are the jobs that have run in this order and so on. And by going into a job, you see all the steps, so you have this visual representation of what happens and by whom. You can even go to a commit and so on. So that's kind of the idea of how Circleci looks like what it does. So when I go ahead with these talk, you see examples and you'll have a kind of clearer picture of what's happening and how. Anyway, back to the talk. Slow builds. When do you know you have a problem? Obviously I've talked about ones example, which is a very, very acute example, very gapingly obvious one, which is when you're considering turning off your CI because it's not going anywhere, then you definitely have a problem. On the other hand, you might notice that your builds are just becoming slower and slower. But that's very hard to see very quickly because builds don't become slower from today to tomorrow. They become slower with weeks and months of work and adding features, adding commits, like growing these code base. That's how builds become slow, because you kind of add all that complexity, but you do it so gradually. Or maybe you're looking at how often does it break? How often do builds break? Or how often do builds stay broken? Which is probably even more important metric. You may be looking at how from month ago you took five minutes to fix a build and now you take 20 minutes. Where did that kind of complexity emerge from? Like you can trace it from there. But my favorite example is doesn't taken any scientific, any tools to detect. It's basically asking your team, your team will tell you that they're struggling with the CI. They're going to complain about it on every single retrospective or every one on one you have with them. You hear it from one person, maybe it's just one person, but if you hear it from multiple people on the team repeatedly, then you know you have something to work with. Okay, so whats do we do when we know that we have slow build? First we need to figure out what's slow. And to do that we need to actually measure something. And measuring it comes in many ways in shapes and forms, because from very rudimentary measurements, like the entire build run or pipeline run time, how long that takes, or to for example, individual job, how long is that kind of red, or how long it took you to take it from red to green, or how often it goes red. All of these things are pretty easy to measure. You can essentially track it in a spreadsheet if you have no other tools. But usually you actually have tools at your disposal that are much better suited for that in a lot of cases, you'll be able to kind of drill down to an individual job, say, okay, these are my functional tests, these are my integration tests, these are my unit tests, this is my deployment. And see on a job basis what's taking that so long and why. And so you kind of pinpoint the problem a lot easier. And in some cases you can even drill it down to an individual job, test jobs kind of test suite that's going to run. And you'll see, okay, these tests, this subset of all my tests, is taking the longest or is the flakiest and causing most of the builds to fail in the process. So depending on the tools you have at your disposal, you can measure different things. But yeah, the more you measure, the clearer you can present it to yourself, the better. Essentially, with Circleci, we have these insights feature, which is, yeah, it's in the dashboard and you see just from bird's eye to drill down view into your builds, you have historic views of how things have fared. So have your builds increased in duration over the last 90 or so days? Which builds have the greatest or lowest success rate? Or even. We recently rolled out this kind of test drill down feature in preview, which allows you to actually see how your individual tests or tests components are faring, which is pretty, pretty cool. So, yeah, now we've hopefully identified what's going on and we know that we know these to actually start looking or improving our builds so that, yeah, these are faster, they're less agonizing. For us, first thing to do is increase horsepower, or the computing equivalent is add more resources to your builds. That's a very easy thing to do. It's very straightforward to do. But how do you know that? Actually, my jobs are running on underpowered executing machines. You'll see out of memory errors. That's a pretty clear example. Like you have an out of memory error, add more memory, you'll be good. Sometimes you know that the tools that you're using to build and run your tests can benefit from more threads or more highly parallel cpus. So that's where you can experiment there. My personal favorite example is if something runs faster on your local laptop developer laptop than it does on your CI, it's usually because, yeah, your developer laptop is, I don't know, an I seven with 16 gigs of ram. So ample kind of horsepower to work with. And if that runs your build in three to five minutes, and then your CI runs them in 20 minutes, and turns out that your CI has two or four gigs of ram at its disposal and very much fewer cpu cores. That's probably why your builds are slow and you take it closer to what your laptop is working with and you've got instant improvement. But that's like a very, very easy thing to do, because with Circleci, for example, you're just going to specify this resource class and job done. It's as easy as that. Next thing to do is start thinking if you can go parallel. By that I mean speeding up all your work by running it in different streams, as opposed to kind of running it in a single stream. First thing to do is make sure that your jobs are running in parallel themselves. By default, that's what happens in circleci, so you don't have to worry about that. But depending on what tool you're using, your jobs may not run parallel by default. So you kind of make sure, okay, we're running everything, unit tests, security scans, functional tests, all at the same time, and only when they're kind of all green. We kind of continued with these latter stages of the build, which might be creating a production build or deploying this to somewhere. The other one, which is more useful, I would say, is splitting tests within an individual test suite. So if you can imagine, our functional test suite had a lot of tests that was like hundreds of functional tests that took, yeah, most of the build time to complete. So you can actually split those and make sure that it runs in multiple parallel streams. First thing to do is specify this parallelism here, and then use the Circleci Cli to essentially tell it to how to split or arrange those tests. My favorite one is by time. So essentially you tell it how you run it once, which gives it kind of the idea of how long each tests run. And then basically it will figure out a way to split your test job into roughly the chunks that are running at roughly the same time, which is pretty cool. Now, we've kind of used the right size machines and we're running everything nicely parallel. What else can we do? We can utilize the cache to essentially deduplicate parts of our work, because that's what caching really is. I like to think of caching in two different types. So what you need and what you build. So what you need is all the dependencies that you need before you can actually start doing a build or running tests. For example, git cache is one example, but git cache you usually don't have to worry about, because that's cached by default in most places anyway. The second one is dependencies. If you've ever run create, react app or installed create react app. You can see how long that taken to install all the NPM dependencies and it's just a react application and that's just like JavaScript, different languages, different frameworks, they will have different tools for managing dependencies. So all of that depends. But essentially once you load them, once you can actually cache those and then it can be reused across subsequent builds. As a matter of fact, caching usually comes for free when you're using some of our orbs because like the node orb for example, you have a command in there called install packages which will utilize cache by default. So it's just like one command run for you. And under the hood it's actually doing five or six steps to kind of restore cache and save it so that each time your package json the lock, one actually package log json changes only then it will kind of reload things. Otherwise it'll just kind of keep using whatever it has in the cache for the duration of that cache anyway. Lastly, if you are running your builds on Docker inside docker containers, and you're relying on a lot of installation for your purposes, so if you need to install any other tools that aren't kind of built into the image that you get from like Circleci, python image or node image, you can think about creating your own docker image that has all those tools bundled already. So you kind of save that installation time, especially if you need to kind of compile some of the dependencies each time. You can imagine whats can take quite a lot of time to kind of speed that up. And the second part of caching is what you actually build. So I'm talking build artifacts. So from finished applications that you kind of want to run functional tests again, to intermediary kind of artifacts that are used by different aspects of builds or different. If you're kind of building Android applications, it's basically generating a lot of kind of intermediary dexes and modules that you can easily then reuse. So it's pretty cool. Lastly, if you are building docker images, so previously we mentioned using docker images, but if you're building docker images to maybe deploy to kubernetes or somewhere, you can actually turn on Docker layer caching, which essentially machines all the instructions up to the point where you have made any changes. So if you're thinking about how a Docker file is written, you have run commands, copy commands, all of those different commands, and they would be automation cached up to the point where you kind of have any changes made which is pretty cool as well. Beyond caching and beyond parallelism, we have to go a bit more cleverer. We are out of pure technical solutions for our problem. We really need to think about how we can reduce the scope of our work to make everything go faster. Because when less things are run, things go faster. So in our case that was where we could get a lot of gains because we had a lot of tests. We had a lot of tests for every single feature. And if you can imagine, some of those features would be very critical to be tested. For example, in a shopping application, if someone can't register or make purchases, that's pretty critical aspect of the app. But on the other hand, if they can't change their profile picture, change their preferences for how the homepage should look, that might not be the most critical feature. So you really think about which tests you really need. If you have that many especially, and which ones don't need to run at every single build, you can always run them later or like asynchronously. But for most of the builds, you really kind of want to make sure that they run as fast as possible. Or like you run the most valuable ones all the time and then less valuable ones less frequently. You can also make sure that your workflows reflect what your actual development workflow is. For example, if I'm working on a feature branch, I don't want to run all the tests at every single commit, but only when I'm kind of merging stuff to main taken. I actually want to run like a more substantial chunk of tests. And maybe even when we are doing planning a release, we're pushing a release tag, then run the entirety of the scope of the tests. So you really think about what needs to run and how, in which case so that really kind of speeds up your entire process. And lastly, get your team on board. DevOps is a cultural movement first and foremost, and culture starts with a team. You can't be successful in CI CD in implementing CI CD if there is only one person, if you're the only person implementing or maintaining CI CD pipelines, you have to get others on board. That's for maintenance, that's for writing improvements to your pipelines, that's for making sure that you fix builds effectively. Really, it's crucial that everyone knows where things happen. And it's also very valuable because if you think about it, if your entire CI CD pipeline is defined in a single file which tells you the whole story of how tests are run, which tests are run, and what happens when you need to deploy then adding a new member of the team is basically as easy as telling them to look at your CI CD build script and they'll see which ones are the moving parts and what they might need to look into to start being more productive. And also, if the entire team is on board, then they can all kind of think about how they could contribute to making this as smooth and kind of positive experience as opposed to this tedious, frustrating piece of work that we started off with, perhaps. So, yeah. Now I want to ask you another question, and it's what is the right build time for your team? I can talk about build times, I can talk about speeding up your builds, but I don't know what your team does. I don't know how many developers you have on your team, what their skill levels are, what kind of application or software you're building. So it really depends on your own team to kind of decide what are the limits that you want to kind of operate or constraints you want to operate within, and think about yourself, where you want to go with your CI CD pipelines. What you can do, however, is look at this state of software delivery report, which we released late last year. Essentially, we looked at an obscene number of projects and organizations that built on circleci and how they kind of compare when it comes to build speeds, time to recovery, and yes, time to recovery tends to be more important than actual kind of build times. And you can actually read about this in these benchmarks and see where you kind of fare compared to others on your team. So yeah, I alluded to this earlier, but I don't think that the need for speed is the right thing to be looking at, especially not with CICD, because it's not a race. It's much more closely resembles an ambulance because yes, it should go fast. An ambulance does go very, very fast if it needs to, but it also needs to go reliably because it's carrying that signal that we talked about. It's carrying that signal that is only kind of useful if it emerges in the end as kind of intact as possible. You can have a CI build run in 5 seconds, but that's probably not going to run all the tests that you actually want to run or need to run for success. And your signal, your CI CD's cargo, if you will, is the most precious cargo. And yes, if you see the lights on the ambulance or on the CI, if it goes red, you really should be all hands on deck fixing this because yes, you don't want to be lingering on and having your builds in an unshippable state. In any case, that whats the need for speed. My name is Zan Markan. I've greatly enjoyed giving this talk. Please do reach out to me either on Twitter or in email if you have any further questions. Thank you very much.
...

Zan Markan

Developer Advocate @ CircleCI

Zan Markan's LinkedIn account Zan Markan's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways