Conf42 Python 2021 - Online

Stop Writing Tests!

Video size:

Abstract

Under-tested financial code is a very bad idea - just ask Knight Capital how they lost $460 million in less than an hour. More often, bugs expose you to a little more risk or a little less value that you expected… but what can we do differently?

We often think of manual testing as slower and less effective than automated testing, but most test suites haven’t automated that much!

Computers can execute all our pre-defined tests very quickly - and this is definitely a good thing, especially for regression tests - but the tricky parts are still done by humans. We select test cases (inputs) and check that the corresponding outputs make sense; we write functions that “arrange, act, and assert” for our tests; and we decide - or script via CI systems - which tests to execute and when.

So lets explore some next-generation tools that we could use to automate these remaining parts of a testing workflow!

Property-based testing helps you to write more powerful tests by automating selection of test cases: instead of listing input-output pairs, you describe the kind of data you want and write a test that passes for all X…. We’ll see a live demo, and learn something about the Python builtins in the process!

Code Introspection can help write tests for you. Do you need to know any more than which code to test, and what properties should hold?

Adaptive Fuzzing take CI to its logical conclusion: instead of running a fixed set of tests on each push, they sit on a server and run tests full-time… fine-tuning themselves to find bugs in your project and pulling each new commit as it lands!

By the end of this talk, you’ll know what these three kinds of tools can do - and how to get started with automating the rest of your testing tomorrow.

__

Is it really automated testing when you still have to write all the tests? What if your tools:

  1. wrote test code for you (‘ghostwriting’)
  2. chose example inputs (property-based testing)
  3. decided which tests to run (adaptive fuzzing)

Now that’s automated - and it really does work!

Summary

  • Zac Hatfielddodds is giving a talk called stop writing tests for comp 42. He acknowledges that the land he's giving this talk from in Canberra, Australia, was originally and still is, land of the Nanowell people. These land was never actually seeded. It was settled and colonized.
  • Testing is the activity where you run your code to see what it does. This excludes a number of other useful techniques to make sure your code does the right thing. Is this actually automated? Let's see what else we could automate.
  • Migrating a test for something that looks a lot like git. First thing we can do is just refactor our test a little to pass in the name of the branch as an argument to the function. And as it turns out, git has a complicated spec for branch names.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
My name is Zac, Zac Hatfielddodds, and I'm giving a talk called stop writing tests for comp 42. Now, this might be provocative, given that I do in fact want you to continue testing youll code, but it's provocative for a reason. Before we get into it, I want to start with an australian tradition that we call an acknowledgment of country. And in particular, thats means that I want to acknowledge that the land I'm giving this talk from in Canberra, Australia, was originally and still is, land of the Nanowell people who have been living here for tens of thousands of years, working lands, learning, and acknowledge that these land that I'm living on was never actually seeded. It was settled and colonized. But back to testing. I'm giving a talk about testing, and as part of that, I should probably tell you what I mean by testing. I mean the activity where youll run your code to see what it does. And importantly, this excludes a number of other useful techniques to make sure your code does the right thing, like linting or auto formatting, like code review or getting enough sleep, or perhaps even coffee. And specifically, the activity that is testing usually means we choose some inputs to run our code on. We run the code, we check that it did the right thing, and then we repeat as needed. And in the very old days that might have been, or for some problems. Now, we still automate repeat as needed. So you do it all manually the first time, and then you record that in a script, and you can run it with something like unit test or pytest. But all of the other parts of this process are usually totally manual. We choose inputs by hand, we decide what to run by hand. We write assertions for a particular input or giving a particular output. So let's see what else we could automate. And we're going to use the example, thanks to my friend David, not of reversing a list twice. We're going to use the example of sorting a list. Sorting is a classic algorithm, and you've probably all sorted things a few times yourself. So our classic sorting tests might look something like this. We say that if we sort a list of integers, one, two, three, we get one, two three. So we're checking that started things stay sorted. Or if we sort a list of floating point numbers, we get the same sorted list, but the elements are floats. This time, because we haven't actually changed the elements lands, we'll check that we can sort things that aren't numeric as well. In order to avoid repeating ourselves, we might use a parameterized test. This makes it much easier to add more examples later as they come up in our regression testing, or if bugs are reported by customers. It's a little uglier, but it does help scale out our test suite to more examples. My real question though is, is this actually automated? We've had to think of every input and every output, and in particular we've had to come up with the outputs pretty much by hand. We've just written down what we know the right answer should be. But what if we don't know what the right answer should be? Well, one option would be we can compare our sort function to a trusted sort function. Maybe we have the one from before the refactoring, or the single threaded version, or a very simple bubble sort, for example, that we're confident is correct, but is too slower to use in production. If we don't even have that, though, all is not lost. We don't have the known good version, but we can still check for particular errors, and this test will raise an exception if we ever return a list which is not sorted. The problem is that it just checks that the output is sorted, not that it's the correct sorted list. And as an example, I would point out that the empty list is always in order. So if we don't want to allow the empty list as a super fast performance optimization, we might want to check thats we have the same size of output as we had of the input. And additionally we'll check that we have the right elements by checking that we have the same set of elements in the output as we did in the input. Now this isn't quite perfect. First, it only works for lists where the arguments are hashable. Thats is, we can put them in a set that's basically fine for now, but it also permits an evil implementation where if I had the list one two one, I could sort it by replacing it with the list one two two. So I've actually changed one of the elements, but because it was a duplicate of one element, and it's now a duplicate of a different element, the test would still pass. To deal with that, we could check that by the mathematical definition, the output is a permutation of the input. Now this is a complete test. These only problem is it's super slow for large lists, and so our final enhancement is to use the collections counter class. So we're not just checking that we have the same number and the same set of elements, but that we have the same number of each element in the output as in the input. And so we've just invented what's called propertybased testing. The two properties of the function that we want to test are that when you sort a thing, the output is in order and thats the outputs has the same elements as the input list. And so these are the two properties of the sorting function lands if we test them. This actually is the complete definition of sorting. If we take an input list and we return can output with the same elements in ascending or at lets nondescending order, then we've sorted it. I don't want to go too far though, like this is a fine test, and it's actually pretty rare to have a complete specification where you can list out and test every single property. And unless someone is like deliberately trying to sneak something past youll test suite, which code review should catch, this kind of test is going to do really well too. But in this example we've still got kind of one last problem, which is we still have to come up with the arguments, the inputs to our test somehow. And that means that however carefully we think of our inputs, we're not going to think of anything for our tests that we didn't think of when we wrote the code in the first place. So what we need is some way to have the computer or a random number generator come up with examples for us, and then we can use our existing property based tests. And that's exactly what my library hypothesis is for. It lets you specify what kind of inputs the test function should have. Lands. Then youll write the test function that should pass for every input. So using that exact same test body that we've had here, we can say that if our argument, that is our input, is either a list of some mix of integers and floating point numbers, or a list of strings, we can't sort a list of mixed strings and numbers because we can't compare those in python, but we can sort either kind of list then run the same tests. If you do run this though, the test will actually fail. And it will fail, because not a number compares unequal to itself. It's always false, no matter what the order should be. And in fact, if you try sorting lists with lands in them, you'll discover that things get very complicated very quickly. But for this kind of demo, it's perfectly fine just to say, actually, we don't care about Nan, that's just not part of the property of sorted that we're testing. So I think propertybased testing is great, and I want you to get started. And in order to do that, I've got a foolproof three point plan for you. The first is just to pip install hypothesis. It works on any supported version of Python three super stable these. Second is to havent a skim of the documentation, and the third is to find lots of bugs and hopefully profit to make it easier to get started, though, I've actually developed a tool I call the hypothesis ghostwriting, where you can get hypothesis to write your tests for you. Let's have a look at that. First of all, of course you can see the help text if you ask for it. We've got various options and flags that you can see, as well as a few suggested things, so let's start by getting the ghostwriting to produce a sort function for us. Of course there's no sort built in, so let's look at sort ed instead. The actual thing you can see here that hypothesis has already noticed two arguments that we forgot to test in our earlier demo. That is the key function, and whether or not to sort in reverse order. But the other thing to note is that it's just said sorted, so it's just called the function without any assertions in the body of the test. This is surprisingly useful, but we can do better. Hypothesis knows about item potence. That is, if you sort a thing a second time, it shouldn't change anything additional to sorting the first time. And if we ask hypothesis to test that, you can see it does indeed check that the result lands, then the repeat result are equal. That's not the only test we can write, though. We could check that two functions are equivalent, and this one is actually pretty useful. If, for example, you have a multithreaded version compared to a single threaded version before and after refactoring, or a simple slow version like perhaps bubble sort compared to a more complicated but faster version. In production, the classic properties also work too, so if you have commutative or associative properties, you can write tests for those. I'll admit, though, these don't tend to come up as often as what we call round trip properties, which just about everyone has. If you save data and then load it, and you want the original data back, you can write a test like this that asserts that if you compress the data and these decompress it or save it and load it, you should get exactly the same data youll started with. Back. This one's crucial because inputs, lands, output and data persistence tend to cross many abstraction layers, so they're surprisingly error prone. But also they're surprisingly easy to write really powerful tests for. So for pretty much everyone I would recommend writing these round trip tests. Let's look at a more complicated example with JSOn encoding. With JSON, the input is more complicated because it's recursive, and frankly the encoding options are also kind of scarily complicated. Just look at how many arguments there are here. Fortunately, we don't actually need to look at all of these, so I've just trimmed it down and that's going to look like this. So we say, well, given our object is recursive, so we have none, or booleans or floats or strings, that's JSON. Or we have lists of JSON, including the nested lists or dictionaries of string keys to JSON values, including maybe nested lets lands dictionaries. But we've still preserved these other things, so we may or may not disallow nan. We might or might not check whether we have circular objects. We might or might not ensure that the output is ASCII instead of UTF eight, and we might or might not sort the keys in all of our objects. So these are nice just to let vary, because we're claiming these should have no impact on the actual body of the test. Let's see if Pytest agrees with us. This is pretty simple. We have a test function, we just run it, and we've been given two distinct failures by hypothesis. In the first one we've discovered thats of course the floating point, not a number value is unequal to itself. Yay for not a number. We'll see more of it later. And as our second distinct failure, we've discovered that if allow can is false and we pass infinity, then encoding is actually invalid, because this is a violation of the strict JSON spec. So I'm going to fix that. In this case, we'll just say, well, we will always allow non finite values just for the purpose of this test, and we'll assume, that is, we'll tell hypothesis to reject the input if it's not equal to itself. That's like an extra powerful assert. And then if we run this version, what do you think we're going to see in this case? We see that hypothesis finds another failing example. If you have a list containing Nan, then it actually compares equal to itself. This, it turns out, is thanks to a performance optimization in cpython for list equality. It will compare itself to the other list by identity first, which allows you to skip the chose in performance of doing deep comparisons when you don't need to. The only problem is that can kind of breaks the object model. So I'll instead do the correct fix, which is to just pass allow Nan equals false to our input generator. And so this ensures that we'll just never generate Nan. And with allow Nan equals just true, we'll also allow non finite examples and this test finally passes. All right, back to the talk. If you can't ghostwriting your test, because for example, you already have a tests suite that you don't just want to throw out and start over with, then of course we could migrate some of our tests incrementally. I'm going to walk you through migrating a test for something that looks a lot like git. And we say if we create an empty repository lands, check out a new branch, then that new branch is the name of our active branch. The idea here is that I want to show you thats you can do this for kind of business logicy things. And I'm going to say like the details of how git works are pretty much like business logic. Rather thats pure algorithmic stuff. But this tests kind of also leaves me with a bunch of questions like what exactly are valid names for branches? And does it work for non empty repositories? So the first thing we can do is just refactor our test a little to pass in the name of the branch as an argument to the function. And this just says semantically it should work for any branch name, not just for new branch. And then we can refactor thats again to use hypothesis and say that, well, for any branch name. And it happens that we'll just generate new branch for now, this test should pass. And then we could share that logic between multiple tests. Again, so far we've made no semantic changes at all to this test function, but the meaning is already a little clearer to me. Given any valid branch name, this test should pass. And now it's time to try to improve our branch name strategy. And as it turns out, git has a pretty complicated spec for branch names. And then the various hosting services also put length limits on there are certain things about printable characters. You can't start or end with a dash, you can't contain white space, except maybe sometimes youll can. But we're just going to say for simplicity. Actually, if your branch name consists of ascii letters only, and it's of a reasonable length, then the test should pass. And we'll come back and refactor that later if we decide it's worth it. And now, looking at the body of the test, this is a decent test, but if we want to clarify that it works for nonempty repositories as well, we might want to end up something like this. We say, given any valid branch name and any git repository, so long as the branch name isn't already a branch. When we check it out, check out that branch name and create the branch, it becomes the active branch. So there we are. That's how I'd refactor. You can run these, of course, in your CI suite or locally, just as you would for unit tests. But that's not the only thing you can do with property based testing. You can also use coverage guided fuzzing as a way to save you from having to decide what test to run and let the computer work out how to search for things for a much longer time. Google has this tool called Etheris, which is a wrapper around lib fuzzer, and it's designed to run a single function for hours or even days. This is super powerful. If you have C extensions, for example, it's a great way to find memory lets or address errors or undefined behavior using the sanitizers and hypothesis integrates with thats really well. So you can generate really complex inputs or behavior using hypothesis and then drive that with a fuzzer. Or if you want to do that for an entire test suite, I have a tool called hypofuzz that you can find@hypofuzz.com which is pure python, so it works on any operating system, not just on Linux, and it runs all of your tests simultaneously, trying to work out adaptively which ones are making the fastest progress. Let's have a look at that. Now, I started this running just before the talk, and so you can see I've got pretty much the whole test suite here for a tool of mine called hypothesmith for generating python source code. And we can see these number of branches or the coverage generated by each separate test. And you can also see that they've been running different numbers of examples based on which ones are fastest, and discovering new inputs or new coverage the quickest. If we go down here, we can also see that we've actually discovered a couple of bugs. So this one, testing ast unpass failed just because as unpass is a new function and it doesn't exist on this version of Python. But if we skip that and we go to the testing the black auto formatter, it seemed to raise an invalid input on this particular, admittedly pretty weird thing. This is genuinely a new bug to me, and so I'm going to have to go report that after the talk. You can also see about how long it took in both the number of inputs lands in the time, as well as a sort of diverse sample of the kinds of inputs that hyperfuzz fed to youll function. All right, so that's pretty much my talk. I want you to stop writing tests by hand and instead use hypothesis lands property based testing to come up with the inputs for you to automatically explore your code, to write your test code, and ultimately to even decide what tests to be run. These tools together can make testing both easier and more powerful, and I hope you enjoy using them as much as I have.
...

Zac Hatfield-Dodds

Researcher @ Australian National University

Zac Hatfield-Dodds's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways