Conf42 Incident Management 2022 - Online

Automate merging to keep builds healthy at scale

Video size:

Abstract

Code-submission processes can highly impact developer productivity, especially as engineering teams scale and codebase complexity grows.

Often, teams that work on a monorepo struggle with keeping their main branch stable, especially as the number of engineers merging changes (and consequently, the number of code-submissions per day) grows. This happens because incompatibilities emerge when multiple changes are combined, causing builds to break frequently. This in-turn cases costly rollback and blocked deployments and hours of engineering lost.

Poly-repo setups present their own challenges: synchronizing merges when changes span multiple repositories, rolling back related changes across repos, and testing across multiple build/test pipelines can become coordination time-sinks for developers.

This talk will feature a distillation of various merge strategies that help teams scale, and their associated developer-productivity trade offs.

Summary

  • Ankit is cofounder CEO Aviator. We are building developer workflow automation platform. Previously I've been an engineer in several companies including Google, Adobe and Shippo. Today we will talk about how to automate merges to keep your builds healthy.
  • There are several reasons why your mainline builds may fail. For instance, there could be still dependencies between different changes that are merging. These kind of failures increase exponentially as your team size are growing. The solution is merge automation, where many companies have launched their own merge queue.
  • Instead of thinking of merges kind of like happening in a serial world, you think about this as like parallel universes by parallel universes. One way to improve on this is combining some of the strategies we've discussed before.
  • At aviator we do automatic merging as well as many other capabilities, many automated workflows for developers. Also have a system to manage flaky test. If you have any questions, you can always reach out to me.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, my name is Ankit and today we will be talking about how to automate merges to keep your builds healthy. Before we go into the details, a little bit about my background, I'm cofounder CEO Aviator. We are building developer workflow automation platform. Previously I've been an engineer in several companies including Google, Adobe, Shippo, Home, Enjoy and Sunshine. You can also find me at Twitter at ankitxG. So the first question you will be asking is merging. How hard can it be? Right? All you have to do is press this green button and your changes are merged. Well, let's take a deeper look. Before we go into details, let's think a little bit about different repositories that teams use. So we have Monorepo, where all the entire engineering team is causing a single repository, and then we have Polyrepo, where every team within a company may be using a separate repositories. In case of Monorepo, the advantages you typically have is easier to manage dependencies. You can easily identify vulnerabilities and fix everywhere together. It's easier to do refactoring, especially if it is cross project. You have a standardization of tools as well as code sharing becomes important, you can kind of like share the same libraries across different projects. In case of poly repos, some of the advantages are simpler CI CD management. Everyone has their own different cis, so there's like Lesser Bill failures, you have independent bid pipelines. Bill failures typically are localized within the team. In this particular context. For this conversation, we will primarily be focusing on Monorepo and identifying what are the challenges in a large team when you're working on a Monorepo. So the first question you would probably want to ask is how often do your mainline builds fail? And is your current CI system enough? There are several reasons why your mainline builds may fail. For instance, there could be like still dependencies between different changes that are merging. There could be implicit conflicts, two developers working on similar code base. You can have infrastructure issues. There could be issue with timeouts. There could both internal as well as third party dependencies that can cause failures. Obviously there's like trade conditions, shared states that also cause like flaky test. In this particular conversation, we will primarily be talking about the stale dependencies and implicit conflicts and why those things are important to think about. So to give you an example, let's kind of just take an example of why this is important in a Monorepo case. So let's say there are two pull requests where they're merging together, or like they're based out at two different times from your main and both of them have passing CI. But eventually when you go and merges those changes, the main line fails, right? This could happen because both of them may be modifying the same pieces of code where they may not be compatible with each other. And the challenge is like as your team grows, these kind of issues become more and more common. Eventually there will be teams setting up their own bill police kind of like to make sure that if ever a bill fails that there is somebody responsible to actually fixing that. There will be like typically people will do rollbacks, releases get delayed. There could be like chain reactions where people are basing off of failing branches and then eventually they'll have to figure out how to actually resolve those issues. And these kind of failures increase exponentially as your team size are growing. And that's why it become very critical at certain point in your team to make sure you can actually take care of it so that developers productivity is not significantly impacted. You don't want a situation where everyone is just like waiting for the bills to become green. They cannot merge any changes because builds are always broken and you're losing important developer hours. So then what's the solution? The solution is merge automation, where many companies, like for instance you may have heard of like GitHub launching their own merge queue. GitHub also has something similar called merge train. There are some open source versions that exist today, like hours that can also provide similar capabilities. So we'll be diving more into how this merge automation works and how you can adopt that internally. To give you a very simple example, let's take here that instead of like. So let's kind of see this pull request which is pr one. Instead of developers manually merging these changes, they would typically inform the system that it's ready. And at that point, instead of merging changes itself, the system would actually merges the latest main into this pull request and then run the CI. The advantage here is you're always validating the changes with the most recent main. Now if the CI passes, this pr will get merged. And meanwhile, let's say if there's a second pr that comes in while the CI is running, it's going to wait for the first pr to merge before it picks up the changes of the latest main and then we'll process and run the same thing. So once the second pr passes also it's going to merge the same way. Let's kind of look at some of these stats. Let's assume for sake of this conversation that the CI time is about 30 minutes, hours. Pull request. You're merging in a small team is about ten prs a day. This would take about. If you kind of run it serially, it will take about like 5 hours because you're waiting for each PRCI to pass before running the second one. The total amount of CI you will run is about 50. Now the challenge is if you're in a big team where you're merging 100 prs a day in the same amount of CI, if it takes the same amount of CI now you're looking at completing about 50 hours of merge time. Now this is unrealistic for any kind of like system to be able to merges changes as slowly. So then can we do better? One way we can think about doing this better is by batching changes. So instead of merging one pr at a time, what your system can do is it can wait for a few prs to get collected before running the CI. The advantage here is you're creating these batches which essentially make sure, one, you're reducing the number of CI that you're running, but also it helps reduce, essentially help reduce. How long is the wait time. So in this case, if the CI passes, it's going to merge all four of the prs together. Now, considering all of them consist eventually pass the build. And in cases there's a failure, we are going to bisect these batches so that we can identify which pr is causing the failure and merges rest of them. Now here you can imagine like it's going to cause a little bit of slowdown in the system. So let's look at the stats in case there is no failure. Let's say best case scenario, if you're doing a batch size of four now, your total merges time suddenly drops from 50 hours to twelve and a half hours. That's a significant improvement. And also the total number of CI runs are going to be small. But in a real scenario, right, you're going to have failures. So if there is even a 10% failure rate, you see like the merges time increases significantly. So you're like waiting like 24 hours for all your cis to merge, all your prs to merge and the total number of cis also increase significantly. So then the question is, can we still do better? So we are going to think a little bit about think of merges slightly differently. Instead of thinking of merges kind of like happening in a serial world, you think about this as like parallel universes by parallel universes. What I mean is if you think of main as not like a linear graph or like a linear path. You think of this as sort of several potential futures that the main can possibly represent. To give an example, let's think about the optimistic queues. So let's say your main is at this particular point, a new PR comes in, it's ready to merge. So what we are going to do is something similar to before. We're going to pull the latest mainline and create this alternate main branch where we run the CI. Once the CI passes, we are going to eventually similarly merge the changes. But here what we're doing is imagine while the CI is running, a second PR comes in as well. So instead of waiting for the first CI to pass, we optimistically assume that the first PR is going to pass. And in this alternate main, we're going to start a new CI with second pr as well. So once the PR for the first one passes, it's going to eventually merge. And likewise, as soon as the CI for the second one passes, it's going to merge as well. Sorry, it's been a bit slow. So it's going to merge as well. Now, obviously here we are looking at like what happens if the CI for the first one fails. The CI for the first one fails is what we're going to do is we're going to reject this alternate main and essentially create a new alternate main where we're going to run rest of the changes and follow the same pattern. And in this particular case, we're going to make sure the pr one does not merge and because a bill failure. So in best case scenario, given that we're now not waiting for NECI to finish, you can technically merge all your 100 prs in less than an hour. Obviously, in case of a median case where we expect like 10% of the prs to fail, your merge time is still very reasonable. Now you're merging in 6 hours instead of the twelve and a half hours that we were seeing before. And alongside your CIA runs are slightly higher in this case because you can possibly be running this multiple CIA at the same time. So then the question is, can we still do better? One way to improve on this is combining some of the strategies we've discussed before. So let's say if you combine the strategy of optimistic Q with batching. So instead of running a CI on every PR, now we kind of combine them together. So essentially you're running like these batches of pures, and again, as they pass, you merge them. If they fail, you split them up and identify what causes a failure. Let's look at the stats again. So now we are saying the total merge time is still less than 1 hour. But what we have done is we have reduced the total number of cis to 25 instead of 100. And even in the median case, the merge time is lower, from 6 hours to 4 hours, and your total number of cis are still lower. So can we still do better? Now let's think about some more concepts here. One of the concept is predictive modeling, but become that. Let's think about what happens if we assume all possible scenarios of what the main could look like if a particular CI is going to pass or fail, or PR is going to pass or fail. So in this case, we have represented these three prs and all possible scenarios where all three of them merge, one of them merge or two of them merge. And if you run these, essentially, if you run in this way, then we never have to worry about failures, because we are already running all possible scenarios and we know one of them is going to be successful. Although the challenge here is obviously running a lot of CI, right? We don't want to be running too much CI, and this is where it can be interesting. So instead of running it on all of them, what we can do is we can calculate a score and based on that, essentially identify which ones are worth, which paths are worth pursuing. So you can do optimization based on lines of code in a PR, types of files being modified, test added or removed in a particular PR, a number of dependencies. So in this case we have specified the cutoff as zero five. And as you can see, we are running only a few of these, excuse me, we are running only a few of these builds paths, and that's why I'm reducing the number of CI. So now you're obviously asking, can we still do better? Now we're going to think about concepts of multi queues. So this is applicable in certain specific cases where we can understand different builds. So instead of thinking of this as a singular queue, now we're going to think of this as many different paths you can take and many disjoint queues that you can run. So we use this concept called affected target. So there are systems like Bazel that actually produce these results, or like these affected targets. So essentially, if you can identify what builds within your primary repository that a particular change impacts, you can create disjoint queues where all these queues can be independently run, while making sure hours builds are not going to impact it. So let's assume that there are like four different kinds of bills that your system produces A-B-C and D. And this is kind of like the order of the prs that they came in. You can potentially create hours different queues out of it. So let's say kind of like the prs that impact a are in first queue, the prs that impact b are in second queue, and so on. So one thing to note here is a PR can be in more than one queue if it's impacting more than one target, and that's totally fine. So essentially for PR four or PR five to pass, in this case it need to wait for PR two to pass or fail. But at the same time, we are still making sure that in a worst case scenario where a PR fails, we are still not impacting all the prs in the queue, but only the ones which are behind that particular queue. This definitely increases the velocity at which you're merging the changes, because it's in some ways localizing failures to a particular affected target. So this is a great example where we are looking at two separate queues. Let's say one target is back end, one target is front end, and there are multiple prs queued, but they can independently be compiled and run while making sure that you're not impacting one change is not impacting the other one. And that way you can run them parallel as well as merge them while impacting, while not impacting the builds. So now can we still do better? There are like few other concepts that we can think about to actually make it further optimize these workflows. So one of them is thinking about reordering changes. For instance, you can select the high priority changes or the changes where there's lower failure risk and put them ahead in the queue. The advantage here is it's not going to cause a possible can reaction of failures, and it's going to kind of reduce the amount of possible failures you can have. You can also kind of order it based on priority, something which is like really big change. You can probably say it's going to be lower priority and we're going to pick it up later. There are other concepts of like, for instance, fail fast, so you can reorder the text execution, the ones which typically fail more often. You'd probably want to run it sooner. That way, as soon as the PR gets skewed, you identify these failures and are able to fail fast. The other things you can do is you can split the test execution. This is what many companies do where they will run many of the fast test before merging. And making sure kind of like these are the ones which are more critical, or things which possibly fail more often and then run the smoke test, or like things which typically are stable but maybe slower, but you run them after they're merging. Obviously you expect the tests you're running after merge fail very rarely, but if it fails, you can just roll back. So essentially you're trying to find the best of both world to make sure your builds are generally passing. And very rarely, if it fails, you have a way of automatically rolling it back. So that's about it. I'm going to share a little bit about some of the other things we at aviator work on. So we also support scenarios of Polyrepo. To give you a very quick example. Let's say you have different repositories for different types of projects, and you're kind of like merging a change. Let's say you're modifying an API. That's going to also impact how that API interacts with web, how it interacts with iOS and Android. Now here the issue is, let's say you make this change, which kind of requires all of them to be modified, but you merges three of them and one of them fails. Now this can cause eventual inconsistency, and our system essentially ensure we have a workflows, we call it chain sets, that essentially combines or defines these dependencies of changes across the repositories. So you consider them as like single atomic unit. So essentially we make sure all of these prs either merge altogether or none of them merge at all. We also have a system to manage flaky test. As you can imagine, as your system and teams grows, everyone is very familiar with flaky teams. Your system becomes unstable. You'll have like these shared states or inconsistent test behaviors where your test may fail, even though you may not have made any change related to that. So what hours system typically does is we identify all these flaky tests in your runs and essentially provide with a customized check that you can use to essentially automatically merges the changes. So what we typically do here is we identify whether a particular flaky test which has happened is related to your changes or not. If it is not, then we suppress that test in the test report so that you get a clean health of build, clean build health, and then you're able to merge that changes. So this works very well when you're thinking about automatic merging, because you can use this check as a validation to make sure your systems are still healthy. We also have capabilities to manage stack prs. So if you have prs which are stacked on top of each other. We have capabilities through which you can automatically merge these changes as well as there's a CLI to sync all your stacked prs together as well as be able to queue all of the pr to be merged together. And again, kind of like it follows similar patterns of chain set that we described. So we automatically are able to identify dependencies of all these stacked prs and consider them as atomic units. So all of them merge together or none of them merge at all. I think that's about it. So thank you very much. These are some of the references that we use for this talk, but I really appreciate you spending time listening to this. If you have any questions, you can always reach out to me. My email is ankit at aviator Co. Let me see if this is going to the next screen. And yeah, check it out. Check us out. At aviator we do automatic merging as well as many other capabilities, many automated workflows for developers. If you have any questions, as I said, you can reach out to me. You can also send me a DM on Twitter. Thank you and have a great day.
...

Ankit Jain

Co-Founder & CEO @ Aviator Technologies

Ankit Jain's LinkedIn account Ankit Jain's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways