Conf42 Site Reliability Engineering 2022 - Online

How Static Code Analysis Prevents You From Waking Up at 3AM With Production on Fire

Video size:

Abstract

Computer programming is a powerful field. You can tell the computer to do just about anything you want as long as you can describe it. The real problem comes when your intentions and what the computer understands from them differ. This talk would cover ways that static analysis tooling can prevent bad code from being sent into production with a particular focus on Go because that is the language that the speaker is the most experienced with.

Waking up at 3 AM because of an obviously wrong bit of code is hitting a weird failure case and is causing downstream issues is a uniquely frustrating issue enough that it deserves to be categorically eliminated as much as possible. Static code analysis is an important part of reliability that will make it easier to make reliable systems because code that can’t be put into production can’t fail at 3 AM while you are trying to sleep.

Summary

  • static analysis helps you engineer more reliable systems. This will help you make it harder for incorrect code to blow up production at 03:00 a. m. As a disclaimer, this talk may contain opinions. None of these opinions are my employers.
  • static analysis lets you step closer to correctness without going the maximalist route. It's a balance between pragmatism and correctness. Here are some patterns for things that can be solved with static analysis in go.
  • We have two different variables named x, and they are different types declared at different places. In a type assertion like this, the red variable is not an int, but the yellow variable is an int. The correct fix here is to rename the yellow x to xint.
  • Sometimes you need to write your own error types with go interfaces. This will help ensure that this kind of code never enters production. Adding static analysis to your continuous integration step can allow you to walk down a new middle path. Trivial errors will be blocked from going into production.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Am am you? Hi, I'm Z ISO and today I'm going to talk about static analysis and how it helps you engineer more reliable systems. This will help you make it harder for incorrect code to blow up production at 03:00 a.m. There are a lot of tools out there that can do this for a variety of languages. However, I'm going to focus on go because that's what I am in expert in in this talk, I'll cover the problem space, some solutions you can apply today, and how you can work with people to engineer more reliable systems. As I said, I'm Zee. I'm the archmage of infrastructure at tailscale. I've been an SRE for several years and I'm moving over into developer relations. As a disclaimer, this talk may contain opinions. None of these opinions are my employers. I'll have a recording of this talk, slides, speaker notes, and a transcript of it up in a day or two after the conference. The QR code in the corner of the screen will take you to my blog when starting to think about a problem, I find it helps to start thinking about the problem space. This usually means thinking about the problem at an incredibly high level and all of its related parts. So let's think about the problem space of compilers at the highest possible level, a compiler can take literally anything as input and maybe produce an output. A compiler's job is to take this anything, see if it matches a set of rules, and then produce an output of some kind. In this case, with the go compiler, this means that the input needs to match the rules that the Go language has defined in its specification. This human readable specification outlines core rules of the Go language. These include things like every Go file needs to be in a package, the need to declare variables before using them, what core types are in the language, how to deal with slices, and more. However, the specification doesn't define what correct go code is, it only defines what valid go code is. This is normal for specifications of this kind. Ensuring correctness is an active field of research in computer science that small, scrappy startups like Google, Microsoft, and Apple struggle with. As a result of this, though, you can't rely on the compiler from stopping all incorrect code from being deployed into production. There's a wide range of errors that will be stopped in the process, but there are more subtle errors that can squeak by. This is an example of the kind of error that the go compiler can catch by itself. If you declare a value as a string, you can't go put an integer in it. They are different types, and the compiler will reject it. I know one of you is out there probably thinking something like what about rust? What about Hassell? Don't those compilers have a reputation for making very correct code? And you know what? That's a good point. There's other languages that have more strict rules like linear types or explicitly marking when you poke the outside world. However, the kinds of errors that are brought up in this talk can still happen in those languages, even if it's more difficult to do it by accident. Static analysis on top of your existing compiler lets you step closer to correctness without going the maximalist route, like when you port everything to rust. It's a balance between pragmatism and correctness. The pragmatic solution and the correct solution are always in conflict, so you need to find a compromise down the middle. This is because in general, proving everything is correct with static analysis is literally impossible. It takes a theoretically infinite amount of time to tell if absolutely every facet of the code is correct in every single way. But we don't have to be perfect, we have to be good. And perfect is the enemy of the good. And static analysis is more moving you towards perfect while being good. So here are some patterns for things that can be solved with static analysis in go. They are not releasing resources that you acquire, making typos that the compiler can't prove at compile time. Usually this happens with struct tags, invalid constants such as time format, strings, URLs, and regular expressions, and a wide range of predictable crashes or very unintended behavior. These kinds of things are easy to prove and are enabled by default in govet and static check. Also, for the record, incorrect code won't explode instantly upon it being run. The devil is in the details of how it is incorrect and how those things can pile up to create issues downstream. Incorrect code can also confuse you while trying to debug it, which can make you waste time you could spend doing anything else. This is an example of Go code that will compile. It'll likely do what you want, but bit is incorrect. It is incorrect because the HTTP response body is read from, but it's never closed in Go. When you don't close the response body, you will leak the resources associated with that HTTP connection. When you close the response body, it will release the connection so that you can use it for other HTTP actions. If you don't do this, you can easily run into a state where your server application will run out of available sockets at 03:00 a.m. And case you may be tempted to fix it like this. However, this is incorrect too. Look at where the defer is called. Let's think about how the program flow will work. I'm going to translate this into a diagram of how the computer is going to execute this code. This flowchart is another way to think about how this program is being executed. It starts with the HTTP get call on the left side and flows to either crashing or the code finishing on the right. In this case, we start with the HTTP getcall and then defer, closing the response body to the end of the function. Then we check to see if there was an error or not. If there was no error, we can use the response and do something useful, and then the response body comes automatically due to the deferred close. Everything works like you'd expect. However, if there was an error, something different happens. The error is returned and then the scheduled close call runs. The close call assumes that the response is valid bit it's not. This results in the program panicking, which can be a crash at 03:00 a.m. This is the kind of place where static analysis comes in to save you. Let's take a look at what Go vet says about this code. HTTP Get Go line 16 using response before checking for errors it caught the error. To fix this, we need to move the defer call to after the error check like this. This way the response body is closed after we know that it's usable. This will work as we expect in production. This is an example of how trivial errors can be fixed with a little extra tooling without having to rewrite everything in rust. If you use Go test, then a large amount of these go vet checks are run by default. This covers a wide variety of common issues that have trivial fixes that help move your code towards the corresponding Go idioms. It's limited to the subset of checks that aren't known to have false positives. So if you want more assurance, you will need to run govet or other tools in your continuous integration step. And some of you might be thinking, well, if these are so easy to detect, why doesn't gobuild do this? This is a good question. I'm personally on the side of the compiler should aggressively project code as much as possible, but the reason why this isn't done in Go is because it's a matter of philosophy. Go is not a language that wants to make it impossible to write buggy code. Go just wants to give you tools to make your life easier. In the Go team's view, they would rather you be able to compile buggy code than have the compiler reject your code on accident. This is a result of a philosophy of trusting that there are gaps between the programmer and production. During those gaps, there are testing. There's tools like static check and govet, but most importantly, there's also human review to catch other trivial errors. In addition to using govet checks, you can also use static check with this GitHub action. This will automatically download, install, and run static check on your code. Static check catches a wide variety of errors that govet considers out of scope. Here's can example of a more complicated problem that static check can catch but govet can't. The reason why there's a problem here is that go lets you make variables that are scoped to if statements. This lets you write code like this. This is shorthand for writing out something like this. This does the same thing, but it looks a bit more ugly. Either way, the error value isn't in scope at the end of it, so it'll be dropped by the garbage collector. However, let's also consider the other important part of this snippet variable shadowing. We have two different variables named x, and they are different types declared at different places. To help you tell them apart, I've colored the inner one yellow and the outer one red. In a type assertion like this, the red variable is not an int, but the yellow variable is an int that might have failed to assert down. If it fails to assert down, then the yellow x variable will always be can int that will have the value zero. This is probably not what you want, given that the log call with the percent sign t format specifier would let you know what type the red x variable was, and as a result, when you run this code you will get an error message that looks like this unexpected type int. This will confuse the living hell out of you. The correct fix here is to rename the int version of x. You could do this in a few ways, but here's a valid approach. Change the name of the yellow x to xint. This will get you the correct result. You would also need to change the okay branch of the if statement to use xint instead of x, but this is a fairly easy thing to fix. There are a bunch of other checks that static check runs by default. I could easily talk about them for a few hours, but I'm being to focus on one of the more interestingly subtle checks in go. Sometimes you need to write your own error types with go interfaces and their duck typing. Anything that matches the definition of the error interface is able to be used as an error value. I put the definition of the error interface type over to the side and gave you a link to the go documentation for it. In this case, our type failure has an error method, which means that the go compiler can treat it as an error. Given that the error function returns a string, that means that our failure type is an error. However, something else to keep in mind is that the receiver the function is a pointer value. Normally this means a few things, but in this case it means that the receiver may be nil and as a result the reason may not exist. Because of this, we can return a nil value of failure and then when you try to use it from go, it will explode at runtime panic runtime error invalid memory address or nil pointer dereference boom, it crashed. Seg fault this happens because under the hood each interface value is a box. The box contains the type of the value in the box and a pointer to the actual value itself. But this box will always exist even if the underlying value is nil. This means that the if error not equals nil check will always return true. So you will always try to read from the value, which will always explode because the underlying value is nil. This is always frustrating when you run into it, but let's see what static check says when we run it against this code. Errorbomb go line eleven do work never returns a nil interface value. Haza static check rejects it. If this code was checked into source control and static check was a run in CI, tests would fail and this would never be allowed to be deployed to production. The correct version of do work should look something like this. Note how I changed the failure case to use an untyped nil. This prevents the nil value from being boxed into an interface. This will do the right thing. This will help you ensure that this kind of code never enters production so it cannot fail at untold hours of the night while you are sleeping. As sres, we tend to sleep very little, as is statistically. We have higher rates of burnout, mind fog, fatigue, and likelihood of turning into angry, sad people as we do this job longer and longer, especially if the culture of a company is broken enough that you end up being on call during sleeping hours. This is not healthy. It is not sustainable for us to be woken up at obscene hours of the night because of trivial and prevents errors. If we get woken up in the middle of the night, it should be things that are measurably novel and not caused by errors that should have never been allowed to be deployed in the first place. I don't think I've heard my pager sound in years by this point. Bit. The last time I heard it, I almost had a full blown panic attack. I have been in the kind of place where burnout from my pager severely affected my health. I'm still recovering from the after effects of that tour of SRE duty and this has resulted in me making permanent career changes that I am never put in that kind of position again. I don't wish the hell that I've experienced on anyone. Normally when you're in SRE put into the line of pager fire, it kind of feels like both options suck. Fixing production seems like it'll be impossible. Being able to get more sleep during on call hours seems impossible because things aren't getting fixed, and with an SLA for responding to the pager within half an hour, it just feels impossible. Adding static analysis to your continuous integration step can allow you to walk down a new middle path between these two extremes. It is not going to be perfect, however, gradually things will get better. Trivial errors will be blocked from going into production and you will be able to sleep easier. The benefits of being able to rest easier like this are numerous and difficult to summarize. It could save your relationship with your loved ones. It could prevent people near you from resenting you. It could be the difference between a long and happy career, or having to drop out of tech at 25, burnt out to a crisp and unable to do much of anything. It could be the difference between life and an early, uniquely death from a preventable heart attack. In talks like these, it's easy to ignore the fact that the people that are responsible for making sure services are reliable are that human company culture may get in the way and there may be a lack of people that are willing or able to take the pager rotation. However, when the machines come to take our jobs, I hope that this is one of the first that they take. In the meantime, all we can do is get towards a more sustainable future. The best thing we can do is to make sure people sleep well without having to worry about being woken up because of preventable errors that tools like static check can block from getting into production. If you use go in production, I highly suggest using static check if you find it useful. Sponsor Dominic on GitHub software like this is complicated to develop and the best way to ensure Dominic can keep developing it is to pay him for his efforts the better he sleeps, the better you sleep as an SRE. As for other languages, I'm going to be totally honest. I don't know what the best practices are. You will have to do research on this. You may have to work together with other coworkers to find out what would be the best option for your team. I will say though, bit is worth the effort. This helps you make a better product for everyone and it's worth the teething pains at first. You can do it. I'm almost at the end, but I wanted to give a special shout out to all these people who helped make this talk a reality. I also want to give a special shout out to my coworkers at Tailscale that let me load shed super hard so that I could focus on making this talk shine. Thanks for watching. I'll stick around in the chat for questions, but if I miss your question and you really want an answer to it, please email it to code 42 SRE 2022 at Zserve us. I'm happy to answer questions and I enjoy writing up responses. Have a good rest of the conference. Everyone be well.
...

Xe Iaso

Archmage of Infrastructure @ Tailscale

Xe Iaso's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways