Conf42 JavaScript 2023 - Online

Road to Zero Lint Failures: Tackling Code Quality Challenges at Scale

Video size:

Abstract

Lint rules improves developer productivity and minimizes errors. But what if your large scale application contains thousands of lint failures? Discover effective strategies to resolve them so that we can once again rely on lint rules to ensure consistent code quality and a more maintainable codebase

Summary

  • Chris Ng: For today's talk, we'll go over why lint rules, code quality at scale, and the challenges around that. Lin rules are perfect tools for automating code analysis and caching and even preventing error. Lint rules are amazing tools for making sure everyone's code is consistent.
  • All new rules must have existing failures resolved. Incentives, incentives, incentives. To get to zero in failures, everyone must chip in. Identify the owners, identify the errors, keep it up to date.
  • The road to zero lint failures took almost a year. The last 20% was really, really hard to get to zero. Sustainability is key. It's really important that it's not a chore.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, and thank you for attending my talk on road to zero limb failures. I'm Chris Ng and I'm a senior your staff software engineer at LinkedIn. So let's jump right into it. So for today's talk, we'll go over why lint rules, code quality at scale, and the challenges around that, how to fix it, or at least the path that we took through zero lint failures, and finally any key results and next steps afterwards. So why lint rules? Right? Like, why do we care about lint rules? So back in 2017, we always had a saying that if it's not in a docs, it's not a real thing. You can't be commenting on people's pull requests about certain best practices or kind of like rule of thumb, if it's not in the docs, if it's not in the docs, then you can't flag it as something that the other person must fix. This kind of became a thing because so many different people had different opinions on code quality and what best practices are, and we really needed just some standardization. So Lin rules are perfect tools for automating code analysis and caching and even preventing error. Consistent in your code. Common tools are. What you're familiar with is probably like Eslint. There's also like stylin for your CSS or sess files, and there's a bunch of other tools in the ecosystem for more niche use cases. So lint rules are like really amazing tools for us because they're really automated, right? It can start even at your iDe. So when you're typing something like up here, when you're typing your my component file, you can see that Eslin is already catching unused variable, that you can remove your memory code before you check this code into your repository. We can run some precommit scripts using tools like Husky or to run Eslint on your files, either all of your files or just change files to flag this error again. And then you have another layer on your source control kind of like choice. And here it's a screenshot of GitHub, where we run a GitHub action, testing your lint rules on the files that are being challenges, as you can see here, three cases of catching not everyone will have their id configured, not everyone will have their local procurement scripts running. So you have different layers, but the idea is to catch it as early as possible, right? And then we also use prettier. So there's no more arguments about formatting, no more arguments about tabs and spaces or anything like that prettier kind of like really removed all the bike shedding for us and really saved a lot of time for our engineers. So like Lind rules, they're more than just for sale issues, right? Like prettier is what we use for style issues. Lindrules is what we use for making sure our code is consistent. It even prevents bugs in a lot of cases. I'll show you an example just in a little bit. And more importantly, if you're doing any migrations for your code, anyone who has code running in production for years and years and years, you're always going to need migrations. And lint rules are amazing tools for not only identifying which areas you need to migrate, but also making sure that for the most part, everyone's code is consistent. So it's easier to migrate and do either find, replace, or like a code mod to migrate yourself from a v one to a v two, or even from one framework to a different framework. But yeah, here's a very quick example of Lin rules on Yashin, where it's caching something that you might not at first be super obvious, right? So you have this paging object, and let's say you're updating this paging object that you get from API, but you kind of want to be a little bit safe. And if your data doesn't come back with a paging object, then you do this optional training, right? But if it comes back, then you do some arithmetic. Adding the start and account that will be the new start for this paging object seems kind of good, right? Eslint will actually catch this with the lint rule, no unsafe optional chaining here, where it would say that this result could potentially become non number. Why is that? You would say, right. Unfortunately, in this case, if your argument is an empty object, you would come back with an undefined plus ten, which equals to none. And then that kind of just blows up your code. And like edge cases like this, it's like very easy mistake to happen if you don't have typescript in your code. These edge cases will always happen in large application where it's just not easy or even possible to ask everyone to check for a condition or rely on code reviews where there might be new hires or even more junior engineers code reviewing each other. And you just can't have that much control. Right? Like the ship must keep sailing, and there's no way you can single handedly review everyone's code, or you also would go on vacation and not be around. So yeah, here's like a very simple example of lint rules caching and saving production issue. So really what we're trying to do is like expanding the welllit path so you don't need that constant attention to every single person. Just make it easy to do the right thing, make sure that the right thing is easy to do and is kind of like followed. So what's the problem? Right, let's add all the run rules, just keep adding all the rules and everyone will have perfect code and achieve total nirvana and we can all go home, right? Unfortunately code quality at scale we face a lot of different problems with this approach. But first, what does upscale mean for our application? That meant an application that's been in production for eight years old. We had over 24k files that includes application and test files. I believe there's like over 100k in their repository, a lot of internationalization files for sure, but it's just kind of all over the place. We also had maybe over 100k commits throughout the lifetime of the application. And year over year it's kind of growing in line of code by 21%. There's over like 67 prs per day written by humans, not automated by any kind of dependency upgrades or something like that. And most kind of like interesting for me is that we had over 800 unique contributors to the repository with at least five commits. So that's like a non trivial amount, that's not like someone changing some sort of string or anything like that. Five commits is kind of like, you know, kind of like what you're doing at the very least. And no Javascript, right. 800, that's quite a big number. I don't think that's kind of like the number that every day is working on the product, but throughout the life that's quite a big, and you can enforce that many people even with a small team, right. So really problem with code quality at scale is that we needed length rules to enforce all of these rules, but we're just kind of increasing the amount of failures throughout in the application that's never getting fixed. I think at our peak we had over 7000 errors in the application, just like hanging out there in various failures and no one's really fixing it, right. It's just ever increasing. So ideally we do the campsite analogy where you leave the campsite better than you found it. So if you write a pr, you see a file, you see a lint failure, you fix it as part of your pr and then throughout time we will get to a place where the number of lint failures goes down. What if your campsite already looks kind of like this, though. When you come to it, are you still going to clean up all of the lint failures, or just maybe a couple ones? So this is kind of like our first foray into limiting these lint failures. We set in a couple of rules. Some of them worked really great, even some of them did not work and had to kind of be rolled back. First of all is blocking commits. If there's a link failure in the file. This was really important for us because this really stopped the bleeding. Granted, it was kind of like a not popular decision, but blocking commits, you can override it for sure, but it emails everyone. But blocking commits is the best way to do this. The second kind of thing we try to do is setting limits. So maybe you can have five instances of no unused vars. This is kind of like it kind of worked, but there was so much added work in the accounting and making sure the numbers are right where really you want to just make sure the number goes down. Right. We don't really care that there's five right now, and if it becomes six, it's fine, there's some sort of special case, but really we just want to keep making the number go down, the number of failures, that is. And then finally, the third kind of way that we tried to solve this was a platform review system. Or if you can think of this as a release owner model where there's someone in your particular team that's reviewing your code, and then someone from more of like can experienced group doing some sort of standardization. So everyone is following kind of like some guidelines. So the last two kind of didn't work super well. It works sometimes, but sometimes it didn't really work so well for our use cases. But blocking commits definitely worked. So the only problem with blocking commits, right, like is if you revisited this campsite analogy, but then add in the case where your manager, or maybe your manager's manager, or maybe even the CEO was like, get this in today. What if you need to land something really fast? What we saw was that most people would just kind of move around and kind of even code around files with law of lin failures and then ship it done by overriding or avoiding. Great, right? Let's get this to production service of customers. That's kind of like what we're here for, right? To serve customers. Yeah, it's really great, I think, and can't really fault them, but not really fixing our lint failures problem. So to really stop the bleeding even more, we introduced two new things. One was a lend proposal process, and one was a task force to own, kind of like the lending framework. The point of the task force is to have accountable engineers who will maintain this framework throughout time until we get to zero lint failures, zero warnings, et cetera, et cetera. And the point of the lint proposal process was to make sure that when we add new rules, we don't add like 1000 more errors, which was another case where we would just suddenly get a bunch of new failures because a new rules was enabled and someone didn't bother to do even like a find, replace or something like that, which kind of really made it difficult to get to zero link failures. So the proposal process made sure that all existing failures must be resolved before enabling a rule. A lin rule, as can error, you can enable its warning, you can kind of like solicit feedback and stuff like that. But before its error, all existing failures must be resolved, unless you had a very good reason, usually something that we've seen time and time again take production down. So at that point, let's push this rule out and then make sure production is stable. So, yeah, this was kind of like one of the key turning points for us. All new rules must have existing failures resolved. This was not super popular either, because lint authors were questioning, like, what if there's like 1000 errors I need to solve? What if I don't know how to fix existing errors? What if I don't have the time to do this? What if I break something, right? I break production by fixing lin failure, and there's just other existing failures in the files I'm cleaning up. But this kind of like, revolves around the same theme, right? In order for us to get to zero in failures, everyone must chip in and all of these other kind of problems that Lin orders were facing, the feature themes themselves are facing the same thing, right? Like, they could be facing a lint errors while they're trying to ship a different, unrelated feature and cause something, production breaking. What we've seen is that the lint author is a lot more familiar with what needs to be fixed, and so they're less likely to break something in production at the end of it. What we've really learned is that not all problems are software engineering. Like, just be a human, talk to people. You can get really far with just trying to talk to people face to face or through any chat app, and just explain why you're trying to do this, understand why it's difficult for the other person to do this and help each other get through it together, instead of throwing kind of a package across the wall and making sure not bothering to communicate with the other person. So to our road to zero land failures, we had even other initiatives to get it down to zero. That was just to stop the bleeding, right? So really it starts with incentives, right? Incentives, incentives, incentives. We really need to understand, why would someone want to clean up a lint failure in their code case? Is it just because they need to, because their commit is blocked otherwise? Or is there any other kind of reason for them to clean up? Kind of like learn failures in their team's ownership. So we kind of had these goals of having accountable code owners, prioritizing lin failures, having a business case for getting prioritization, getting some sort of actionable insight for engineers so they know what to actually fix, and kind of tooling to tie it all together. So really we kind of provided a lot of technical support. So if someone's trying to clean up a lint failure, we'll kind of guide them step by step. The developer experience of the red squiggly lines, for example, sometimes breaks. And so our team was always there to help fix those issues. We provided lots of shout outs to engineers, and I want to add very specific shout outs to engineers who have fixed len failures by actually identifying kind of like very topical and understanding of what kind of like exactly in the prs they've fixed and provide them visibility for all their hard work and adequate recognition for taking that extra time to clean up the lint failure. So, tooling, right, we kept it really simple. Identify the owners, identify the errors, keep it up to date, and that's it. Very easy. There are a lot of more sophisticated tools out there, which we ended up looking into migrating right about now, but just at the very start, make it easy for people to just do their job and clean up the failures. So initially, single owner profile was kind of like the thing we need to get to nail down, right? Like you need to find for every single file, there's only one owner. Doesn't have to be one person, right? It could be one team, but there's a particular team who owns every single file. Otherwise, you're just kind of going in whack and all of going around teams figuring out who owns this file, if it's kind of like a dispersed ownership model. We also created a very basic visualizer tool. If you can see here, we didn't even put a font. We were just in a serif font here with an HTML table. But we broke it down by team. We broke it down by lint rules. And also we broke it down by team and lint rules. So if you're kind of like part of the article architects team, and you want to find out all the no unused variables, because you want to just delete all of that, you can kind of drill it down by that and kind of understand how much time you're committing to resolving those lint failures. For our MVP, it updated nightly via Cron job that kind of just like Ran Eslint. It ran on someone's VM machine, which was interesting during COVID and we opened up or we counted errors, warnings and disables. In this presentation, we'll really just focus on errors. Here's kind of like a drill down view of a single lint rule, and the kind of like the files that have this particular lint failure. In this case, it's no unused variables. We use checkup, which is a tool built by LinkedIn as well. That kind of like every night. It's like a node tax runner that will run through Eslint and a bunch of other tooling, and then you can format the Eslint output to a particular format. In our case, there's like a checkoff formatter for Eslint that then translates it to serif, which is kind of like a standard format that's not super digestible when you have thousands of failures. But for few failures, you can click into the file and then it'll jump to the particular line of code in your file. This is kind of like can example of a serif file. Again, your ide usually has a Sarah friendly viewer to get you the jump and click functionality. After that, when we have the Sarah file, we can then create a dashboard. We end up migrating dashboard to kind of like the, the company's kind of. I believe this is a docusource document stack, just because someone's VM environment kept getting shut off and it was just kind of hard to share ownership for that by if someone was on vacation or something like that. So yeah, a lot more prettier UI, more easier to use, but it's kind of like the same data as the original, kind of like serif version of the dashboard. And then we had a weekly scorecard that we would email to all the teams where for every team we gave them a green yellow or red green if you had a negative week over week increase in lint failures. So that means you've decreased the number of lint failures in your app, in your ownership for your team, or you're already at zero for your team. So you're green, you get yellow if there's been no increase or decrease in lin failures. So nothing really happened that week. And then red if there's a positive increase week over week. What we've seen is red is usually happens when some code have changed ownership or something was incorrectly updated in the configs or the dependency or something like that. All of these are really good to know early on rather than someone figuring it out later on that they're trying to fix their lint failures and there's just like a lot of failures that they've never even seen before. So actually it's been quite useful to have these reds, although it never was like a rogue engineer that just pushed in all these Lin failures. This is a sample of a scorecard that we would send out with a number of failure count and the green, yellow or red coloring per team. This was a manual email. We could have automated it for sure, but this was not something we wanted to invest so much time in. What we did invest time in is identifying maybe two or three engineers who fixed big issues that week and give them a personalized shout out and a link to their pr to kind of inspire other people and also kind of acknowledge that they went above and beyond their duties and helped out the overall community. But at the end of it, while we were trying to burn down the number of lint failures, what we end up seeing was an 80 20 rule playing. But in real life, the first 80% was quite fast, but the last 20%, the last mile, was really, really hard to get to zero. There was a lot of stragglers. But the end of it, we did like a team effort. We asked other teams for help. And also at the end of it, too, our team of volunteers in the task force ourselves just kind of like took on these files and cleaned it up. You might ask, does it really matter? You got 80% down. That seems pretty good. And the last mile was really long. If you can look at this chart, the first kind of like. It's almost like half of the chart is the 80%. And the last, mal, is the last. Kind of like another. Another maybe. It's almost a year, right? I think it's another year. Granted, I did go on paternity leave around July 22, which is like right in the middle of this, so I was not able to, but the teams as much and run some of these kind of like emails, but it still took a long time and a lot of hard work to get the last mile. But yeah, the last mile does matter. Sustainability is key. If you see a file in your repository like this, it's much more easy for someone to be like, oh, it's fine for me to add a Lin failure, or if they're touching this file, they're going to just be like, oh, I'm just going to override because there's a lot of lin failures that I didn't cause and was just here. This file, compare a file like this to a file like this where there's no Lin failures, everything's good. It's kind of like having code that's already tested. Right. It's much easier to add a new test than to kind of add in all the test scaffolding for your existing component. Yeah. So for us, what we've seen was sustainability for making sure we're at zero meant that everyone actually had to have zero lint failures. So for summary, for the road to zero lint failures, we had a rule of no new errors. When enabling lind rules as error, there should be no new failures. We had a single owner per file, so we knew which team was accountable per file. We used a cron job every night using a tool called checkup to calculate how many failures each team had. We also did that for warnings and disables, et cetera. We had a homegrown dashboard identifying the issues and with a quick link to the GitHub line itself, which was invaluable for engineers looking to see if they had like five minutes in their day that they want to just quick bang out, like a pr fixing something, they can quickly identify that. We did a weekly scorecard to keep teams and individuals accountable and congratulating. Also teams and individuals and some healthy competition as well. Recognition, same thing as what I was trying to say. It's really important that it's not like a chore, it's more of like a team sport where we as a community try to figure out how to get a clean campsite where everyone enjoys it. It's much easier to work. There's no kind of unexpected work that arises when you open a file and then there's a lot of failures. It's much more maintainable too. And finally, also get your hands dirty. As a task force, which is volunteer run, we kind of went through different teams and just tried to fix their failures for them as well. If they're having a hard time, just because someone has to do it, and as someone leading the effort, you should definitely try to fix it if no one else is trying to fix it. So for the results of our kind of initiative. We kind of learned that individual developer experience was very important. The actual engineer working on the fixes should have all the attention, not kind of like some sort of 100,000ft view of the kind of landscape per individual is very important. The same thing for shoutouts, like very individualized shoutouts was important. And no regressions was also very important. Just because it keeps popping up, these regressions and having some sort of table was very important, was very useful for that. So between November 21 to January 23, we removed close to 6000 lead failures. Took a little bit over a year to complete, with 55 engineers contributing to the effort. We kind of pulled a 30% increase in satisfaction for our quarterly surveys for code quality or perceived code quality, I guess, for individuals working on the repository, which was amazing. This was also coupled with kind of like continued changes in the lending ecosystem. We've had over 80 rules changes or dependency changes in the past year, I believe, and then over 45 unique contributors as well. So it's kind of been like the learning process has still kept going, even though we're trying to get the number of failures down to zero. So if you remember backer code, our quote from 2017, if it's not in docs, it's not a real thing. I think our new quote nowadays is that if it's not a relent rule, then it's not a real thing. And so thank you for attending my talk. Feel free to reach out if you have any questions or comments. Thank you.
...

Chris Ng

Senior Staff Software Engineer @ LinkedIn

Chris Ng's LinkedIn account Chris Ng's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways