Conf42 Python 2023 - Online

Run Fast! Catch Performance Regressions in Python

Video size:


For the same reasons that unit tests are run in CI to prevent feature regressions, benchmarks should also be run in CI to prevent performance regressions. During this presentation, you’ll learn how to implement continuous benchmarking in your Python project.


  • In order to catch a performance regression, you have to first detect it. More often than not, that ends up being in production. So that's what we're going to start off talking with today, is about how to detect those performance regressions using benchmarks.
  • Fizzbud Buzz fibonacci is a new notification feature for the Vzor calendar app. It adds a Fibonacci sequence to the fun notification function. But three weeks later, production was on fire and the feature wasn't working properly.
  • Pytest benchmark is a very popular benchmarking suite within the Python ecosystem. We're going to be working with Pytest benchmarks here because it works and integrates so well with Python. A little note on micro versus macro benchmarks.
  • It would be great to be able to catch your performance regressions before they make it to production and impact your customers. In CI, continuous benchmarking is the thing that we're going to talk about next. How does this happen?


This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to run fast catch performance regressions in Python. I'm Everett Pompeii. I'm the founder and maintainer of a tool called venture, and today we're going to be talking about how to catch performance regressions. Now, in order to catch a performance regression, you have to first detect it. Detection is a prerequisite to prevention. So when do we, when are we able to detect performance regressions? Well, we can do that in development or we could do that in CI or in production, and more often than not, that ends up being in production, which is unfortunate because that means it's impacting our users already and whether or not we have an observability tool. And we can see that before anyone complains, they're nonetheless probably experiencing. So we would, as developers, want to shift left as much as we could, the detection of those performance regressions. So that's what we're going to start off talking with today, is about how to detect those performance regressions using benchmarks. But before we get into that, I'm going to kind of tell a little tale that may or may not be reflective or similar to some personal experiences, but for all intents purposes, it's fictitious. So we've got an app, I've got an app, v zero of the app. It's a basic calendar API, right? And so it allows people to schedule things and create events and things like that. And so this is created in flask, but it could very well have been in Django or in fast API. Take your pick. So, got this calendar app, it's working great. Got the Vzor app. Minimal, lovable product, right? And then I decide, hey, I want to add an additional feature to this, right? And so a fun notification feature. So every few days it gives our users a fun notification. Just kind of out of the blue, right? Help keep engagement. So we're calling this the Fizz feature. And so with the Fizz feature, it returns Fizz if the day is divisible by three, otherwise it returns none. And it's pretty simple feature to implement. The business logic looks like this. It's the fun notification function. It just takes the modulus of three, and if that's zero, then it returns fake, otherwise it returns not. So that's great releasing to the customers, they love our fun notification feature. And so I'm super happy and I'm like, hey, I'm going to make this even better. And I think you guys might kind of know where this is going here, but I decide to improve the fun notification feature, right? And I add buzz, so return fizz if the day is divisible by three, return buzz if it's divisible by five, or fizz buzz if it's divisible by both. So otherwise, still the same, return none. And again, this business logic is pretty simple, right? It's just that same modulus operator. But this time we got both fizz and buzz or fizz buzz. So I ship that to my customers, and they also love it. And so I've got version two out and things are going great, and they love it so much, I'm like, hey, you know what? I think I'm going to add something else to it. And I do my full desired implementation of the fun notification feature, right? Which I call fizzbud Buzz fibonacci. So fizzbuzz fibonacci, though, quite a mouthful. It starts the same as the good old fizz buzz feature, which the three, the five, or both. Except if the day is divisible by seven, then it returns the nth step of the Fibonacci sequence. Otherwise, return that. And still that business logic looks pretty simple. I just have that extra two lines up top where I'm checking for the modulus seven, and then I just do the Fibonacci sequence. And I moved on with my day and shipped it out to customers, and they loved it. And things were going great until three weeks later when all of a sudden production was on fire and I was like, what's going on? What's happening? Right? I shipped a bunch of code between then and the past three weeks, right? And so I had to come in here and spend all day coming back to try and figure out what was going on before I figured out it was this darn Fibonacci feature that I had done three weeks prior. And so I started looking at this and said I should investigate, oh, I should investigate a little bit more. And that's what I did, went and look at my Fibonacci sequence function, and I had done a very naive implementation of it. And so I think you guys are probably smarter than I am and know that I shouldn't have done this to begin with. But before we dig into all that, we're going to kind of do an aside and look at benchmarking in Python and how I could have take this as a learning experience and go and benchmark my naive implementation as I try and find a better solution. So Pytest benchmark is a very popular benchmarking suite within the Python ecosystem. There is also another one called Airspeed velocity, which isn't quite as popular, but is also pretty well known. We're going to be working with Pytest benchmarks here because it works and integrates so well with Pytest. So in order to install pytest benchmark, super easy, just a pip env shell and you just install Python Pytest benchmark. So I have my naive Fibonacci implementation here in my fundnotification py, and so I'm adding a benchmark to it that basically cycles through every 7th day of the month and checks to see how long this takes to run. Now, the key parts in this and where Pytest benchmark, how it works is you're passing in this benchmark argument and that expects to take a function here and then it basically just times that function. So however long it takes that function to run is your benchmarking time, essentially. And so here we're going through every 7th day of the month just to kind of get a feel of what the youll scope of the time that it's going to use is. So in order to run this, you just run Pytest and then your file with your functions, just like you normally do with Pytest. And this is the output that I got for this naive version, right. It's pretty high. It takes over a 10th of a second to run, which is at scale, not a good thing when you have a lot of people using, when I had a lot of people using my calendar app. So then if we wanted to, which is going to be important later, we can run save on our benchmark output. So this will save it to a directory which we can add to git, which then means that over time we can track these benchmarks, even just kind of running them locally here. And so we've got my tested benchmarks, naive implementation here. And so now I'm going to go and add some memoization which help improve the performance, hopefully of my function and do the exact same test. Notice the test has not changed, the benchmark has not changed, but just the Fibonacci implementation has. And so I run that again, same exact call to Pytest and I get this output, which it's like a six less to run, right, because memoization helps memoize. So that is a definite huge performance improvement. Now if we wanted to compare those, we could copy and paste or whatever, but Pytest does actually give us a really nice way to compare. You can just pass the number that it kind of keeps of the previous version to kind of compare those within those saved benchmarks that we just did. So we run that and we get this output, which lets us see our version now versus that previous naive version and how drastically improved things are, which is pretty great. That is a great example of how to run and compare with Pytest. Now, a little note on micro versus macro benchmarks. So far we've been doing micro benchmarks. I think these analogously as unit tests, unit level benchmarking. And so what this does is it's really about just like a single function versus what are called macro benchmarks, which are much more like integration tests. They're kind of full end to end. So with my flask app that I'm using for my calendar API here, here's my endpoint, right? And this is the fun notification endpoint. It gets the date time, it gets the day from there, and then it calls my fun notification function and then jsonifies things and then sends it out. And so the thing is, I will be benchmarking all of that. So if there's any changes outside of my code, it's great because I also can detect that if there's any regressions and libraries I use and things like that. But it is a bit more noisy because of that. And you're also just going to have larger values, just a thing to know. But they work pretty much exactly the same way and the same way that unit and integration tests are very similar. So back to our fizzbuzz Fibonacci feature. I have implemented my memoization, and I was very silly before implemented originally, very naively. So now that we have that fix in, things should be good to go. Right. And so I'm able to come in, play firefighter and put out the fire that I caused in production, which is good. Things aren't on fire anymore. But why do I have to play firefighter? It'd be preferable if I didn't, in the same way that it's preferable that you catch your performance regression or youll feature regressions before they make it to production and impact your customers. It would be great to be able to catch your performance regressions before they make it to production and impact youll customers. And so youll could have observability tools and they can help. But still that is too late, right? Like you're still impacting your customers and users, and so was I in my calendar app here. So production is just too late to catch things and then development is local only. So you've got those saved tests and things like that. But it's very set to only the one environment that it's running in. And it makes it very hard to share that across a code base with multiple users in a development team. So it's great for local benchmark comparison, both in Pytest and airspeed velocity, but it is local only really. And then in CI, continuous benchmarking is the thing that we're going to talk about next, which is what allows you to detect and prevent these performance regressions. In CI, we are going to talk about Bencher, but I will note that airspeed velocity also has some kind of rudimentary basic continuous benchmarking functionality, which if you're kind of looking for a simpler tool to use, that might be worth checking out. But venture definitely has a lot, much more features and is much more robust in this category. So we can go forward taking a look at Bencher. So what if we had continuous benchmarking? What if I had continuous benchmarking when I was doing my calendar API? Right? So rule number one, let's go time travel back. But yeah, don't set it to 2020. And so this is venture. It's the GitHub repo of the open source tool. As we time travel back here with my calendar app, and I'm at that first version with the fizz feature, what would this have looked like? So I would have gone ahead and written a benchmark at that point in time, as opposed to kind of trying to do that proactively, I think is the best way to put that. And so it's a very similar sort of test function as what we wrote before, but it goes over every single day of the month, right. And tests to see how our function performs for our business logic. And so it records that. And then in order to run that in CI, we would need to download the venture CI and install it, which is a simple debian package and super quick and easy. And then we'd run that as part of our CI process. And here we're keeping with the pytest example, the pytest benchmark example. And we are going to output our results to a JSON file. And then Bencher, the CLI will read that in and take that information and store our results. So that's great. We've got our version one instrumented. So then as we move on to version two, we don't really have to do anything, it just picks it up and runs it in CI. And we don't have to manually test anything locally or do any work. It's just automatically picking up and doing the work for us. So that's great. And things seem to be going well. So then as we move on to version three, right, with that naive Fibonacci implementation there, we'll get an alert, which is great, because things are running incredibly slow compared to what they used to. But how does this happen? Let's kind of take a look at that. As you track your benchmarks with venture, youll kind of have that first version, right? And then second version youll add a little bit more functionality. And that third version is when you get a huge performance spike, right? And so that is what triggers the alert. And so don't worry, we're not going to get too much in the statistics here, but this is just a probability distribution, average kind of distribution of what you'd expect. Even if you reran the results multiple times, there'd be some variation. Right. And so with that, that first test is be right there in the middle, and then as you run kind of your second version of the code, those are going to be clustered. The averages of those two is going to be very close together. But that third Fibonacci is going to be all the way out in the extreme. And you set a threshold in venture, and if anything's outside of that threshold, which more than likely that Fibonacci sequence naive implementation would have been, then that is what triggers the alert. So that helps you catch performance regressions in CI. Instead of you manually having to do it yourself, you're able to have CI do it for you and rely on that to catch it. So we looked at and talked about trying to catch things retroactively in production. And all of the work that that takes, putting out the fires and things like that, it's just simply too late. We took a look and learned how to run benchmarks locally, which is with the very useful tools that the Python ecosystem offers, Pytest benchmarks. And there's also airspeed velocity. And then we took a look at using continuous benchmarking with venture and how that could have helped prevent all of that anguish and pain before. And yeah, that's just awesome. So in review, detection is required for prevention, production is too late, development is local only, and continuous benchmarking can save us a lot of pain. So with that, thank you all so much. This has been run fast. Catch performance regressions in Python. That's the GitHub repository for Bencher if you want to check it out. And if you wouldn't mind, please give us a star. It really does help the project. And if that GitHub link is too long to type out, then you can just go to venture Dev repo, and it'll redirect you right there. All right. Thank you all so much.

Everett Pompeii

Founder & Maintainer @ Bencher

Everett Pompeii's LinkedIn account Everett Pompeii's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways