Conf42 Python 2022 - Online

Low Overhead Python Application Profiling using eBPF

Video size:


In this talk, we will demonstrate through low-overhead profiling tools for user-land applications, specifically high-level ones. eBPF is a good basis for profiling tools in general; PyPerf, a BCC-based open-source tool of that kind, provides low-overhead Python applications profiling.

This talk will walk through CPython internals and will then dive into PyPerf. It will then present a comparison to traditional profiling methods and will review the benefits of basing on eBPF vs. user-land, system-calls-based profilers, and how eBPF provides unique seamlessness and full transparency for Python applications.


  • Yonatan Goldschmidt talks about low overhead Python application profiling with EBPF. Even in production environments, you gain visibility on which parts of your code consume the most resources. This helps you to expose interesting performance improvement opportunities.
  • EBPF is a technology that has evolved from the old Berkeley packet filters. It can be attached to virtually any logical point in the kernel. This makes EVPF the most capable tracing or observability infrastructure on Linux. Pyperf is a lowoverhead sampling profiler that can sample at high frequency.
  • Pyperf is a part of gprofilo, which is our system wide, contains profilers for production environments. It supports numerous times not only Python, but also Java and go rust. Feel free to DM or connect on LinkedIn, GitHub, whatever, and please try.


This transcript was autogenerated. To make changes, submit a PR.
Welcome to my talk about low overhead Python application profiling with EBPF. Let's begin. Way about myself my name is Yonatan Goldschmidt. I have six years of experience as an R D specialist in the IDF. I like everything about computers and software, and today I'm a team lead at Granulate's performance research department. About Granulate we enable companies to optimize their workloads, improve performance, and leverage that to reduce costs. And I also like wine, especially in Italy. So why is profiling amazing? It's not a new concept, but it's definitely rising lately, so it's getting easier and easier to apply and use. Even in production environments, you gain visibility on which parts of your code consume the most resources, and this helps you to expose interesting performance improvement opportunities. Let's talk about profile types and focus on Python profilers. We start with deterministic profilers or tracing profilers. They track your program's execution in a deterministic way, for example by instrumenting all code paths to give deterministic results. They are very common and many types exist. Probably the most well known one is c profilers, which is included in the Python standard library, and to the right of this slide we can see the example outputs of it. Now, determination profiles are very useful during development, as they are very versatile and can give accurate metrics on function and line of code level. However, their intrusive design, the need to insert instrumentations in code or in the interpreter, makes them introduce, possibly introduce I lowoverhead to the code execution. They also might require code changes, for example, to enable or disable the profilers or require deployment changes. For example, you need to start your application with the profilers script. These reasons makes them less suitable for production use because you do not want to introduce any overhead and you preferably do not want to make any changes just for the sake of profiling. Now, another profile type is statistical profiles. These work by taking snapshots or samples of your application every set interval, for example, every one millisecond or every microsecond. Instead of continuously tracking everything that's happening over enough time, the accumulated samples portray an accurate image of your application. One common example is Pyspy, which is sampling provides written in rust. This image also shows one way to visualize the output of Pyspy. It's called the flowing graph, and it tells us the relative execution time of different functions and flows in your application. Now, since these samples can be taken externally, therefore, these profiles can be made external to the applications, as in not intrusive thus they do not introduce any overhead to the application itself to some extent. Now, the profile itself is a program running on the system, so it does introduce some overhead to the system, and we'll talk about that overhead when we finally reach EBPF. Now, since it's not intrusive, we do not need to make any changes to the code or deployment. For example, Pyspy can start profiling any running Cpython process just by giving it the process id, which is very convenient. For these reasons, they are much more suitable and safe to use in production. Now, deterministic profilers are generally more versatile in their abilities. So for development environments, when you want to accurately measure a specific functional module, you might want to still use them. Now that's all for the pre EBPF error. Now let's see what EBPF brings to the table of profilers. A primer on EBPF. It's a technology that has evolved from the old Berkeley packet filters, which is a mechanism in the kernel that allows the user to define filter programs for sockets like the one displayed on the screen. It was used mostly for sniffing programs such as TCP dump. The filter program is essentially a small virtual machine with a set of outputs and operations that it can perform on packet data. For example, the program displayed here checks if the packet source IP address or destination IP address is the local host, and if the source portal dev support is 80, and you can certainly monitor the assembly instructions, the VPF assembly instructions for that program. Now, years forward, this simple interpreter for user programs has been enhanced with many more APIs that are not limited and more to package inspection. Also, the programs can now be attached to virtually any logical point in the kernel, not just to the entry of packets. Together, this makes EVPF the most capable tracing or observability infrastructure on Linux. Here's a short example. To the right we have the code of an EVPF program called Opensnoop. It's written in a language called BPFT trace, which is later compiled to the same BPF assembly we saw earlier. You can read about EBPF trace online. This program hooks onto the open system call and thus intercepts all open calls throughout the system. To the left you can see sample output from running it. On my box you can see all sorts of different pids and programs opening different files. You can see how relatively easy it is to write this simple code that attaches onto Cisco and traces all calls with fraud system. And also, I didn't mention the negligible performance effect, which is something that we just didn't have before EBPF. This table describes the difference between standard user code, kernel code, and EVPF. The core thing you need to take from the app is that EBPF is safe. By design, a verified mechanism exists which ensures that only safe programs execute. It also means that EBPF programs are not entitled to do anything they please. For example, they are not able to call arbitrary system calls or perform arbitrary writes to memory. On the other hand, EVPF programs have fast access to opensource, such as memory. For example, they can access the memory of the currently running Python application much faster than Pyspy, which is an external applications that has to run some system calls in order to read the memory of the Python application. Now let's get back to cpython. We needed a lowoverhead sampling profiler, which can sample at high frequency and can easily profile all Python applications running on the system. Plus we wanted it to be able to extract native stacks and kernel stacks. Pyspy, when not introducing overhead on the application itself, does have some overhead on the system. As I said, it needs to access the Python memory in order to extract factories, and it does a lot of Cisco trying to do that, which take time Byspace simply wasn't fast enough when we needed to profile a large cpython application with hundreds of threads at high frequency. So we started looking onto the EBPF approach and quickly found Pyperf, which was posted to BCC as a PoC of an EBPF based Python profile. By the way, we also found a project called Aviperf, which is like Pyperf for OBi, but that's a different story. So we spent a while and added many new features to Pyperf, trying to make it the best Python sampling profile. So first of all, we made sure it supports all currently available Python versions. We made it a system wide profile. That is, it profiles all running Python applications on the system, unlike Pyspy, which works on a per process basis. If I want to profile 50 Python applications, I need to invoke 50 different PY spies, which then introduce more overhead. With Pyperf, I need to add it just once and this profilers the entire system. Additionally, we have added logic to extract the native stacks such as cpython extensions, for example JSon, Piccolo, numpy interpreter code, and native libraries. And we also extract kernel stacks, which can be, for example, the system calls your application is making. These features were relatively easy to add over EVPF because Pyperf is EBPF based, and it would have been much harder if not impossible, and it's been written non EVPF based. So here's an example of how it looks. This is a simple, uniform application, and in yellow rectangles we can see the Python frames from the Python applications. The purple frames are denoted. The purple frames denote a native code, and the orange frames denote kernel code. Together, the combinations of those three portray a very accurate image of the application institution. Now, I will be speaking a lot about native code, which is something that many profilers overlook intentionally saying that the developer should care about the Python code because they do not have control about the native code anyway, so they should just focus on the Python code, and the native frames and stacks are unwanted noise. However, from our experience, we know that taking the native profile into account is very important when you want to truly understand what's going on and which operations on the cpython level are taking the most cpu and time. Therefore, we have invested in making this feature work perfectly in Pyperf. So now we'll do a small exercise. I have this function written here. Can you read its code and guess which operations take the most time? And I'll give you a minute to think and then we will check out the results. And actually it's recorded, so you can just pause and continue when you're ready. I'll continue now. So here I've cut out the relevant native profile of this function. The bottommost frame is the Python function itself, and all frames above it are the native functions that our cpython function, funk I've named it, is calling. I've added some arrows to explain which is coming from where, and we can see some things that I originally, after I wrote this, I did not expect the profile to look like that. For example, I did think that the string concatenation, which we can see to the right, taking a relatively large part of the profile. Actually, it was blanking first, the string concatenation takes a large part. However, I did not expect the cow calls to take a large part of the profile. Also, the model operator takes a relatively large part of the profile. And I only realized that once I've looked at the native profile. What I'm trying to tell you by that is that once we observe the native profilers, even of a simple python function, we can quickly devise ideas on how to improve the Python code of it. For example, after viewing this profile, I now know that the most important optimization to use is to switch from string concatenation to use string I. And after doing that, the next thing I would do is probably to cache the results of car, and after that I would try to avoid the model operator. Now the comparison, which I was thinking, I thought it would take a lot part of the profile. It actually takes almost nothing. You can see it in the middle, it's actually rather small. So you need to profile and you need to look at the native profile in order to truly understand how even a simple cpython function divides its execution time. So that's it on Pyperf. I hope the last part was interesting. Now, Pyperf is a part of gprofilo, which is our system wide, contains profilers for production environments, and it supports numerous times not only Python, but also Java and go rust Obi. So check it out, it's open source. So thank you. Feel free to DM or connect on LinkedIn, GitHub, whatever, and please try. It's fun. Try flippofinic at deepofilo IO. Thank you.

Yonatan Goldschmidt

Principal Engineer @ Granulate

Yonatan Goldschmidt's LinkedIn account Yonatan Goldschmidt's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways