From Slow to Go: Boosting your code with Profile-Guided Optimization

Abstract

This talk would uncover the performance boosts PGO offers to Go apps, from streamlining hot paths to minimizing resource consumption. Not only a theoretical explainer but we’d practically instrument Go binaries to depict the real powers of PGO while touching base with its industrial implications.

Summary

Yash is a software engineer at Red Hat and pursuing a masters at University of Waterloo in Canada. Today he will talk about profile guided optimization and how it can boost your code. In his free time he likes to do long distance running and build stuff.
The magic inside the compiler is behind the scenes. Behind the scenes it does a bunch of steps to make code more efficient. With the help of the this power of compiler and optimizations, you can write a very clean piece of code in an abstracted manner.
Inlining originates from a problem. The act of calling a function is a slow operation in runtime. If there are too many invocations, then the inlining process will be too much. This can lead to a bloated binary. The solution is that compilers could learn more from application code.
Well, the kernel has a beautiful component called Linux perf, and the Linux perf schedules a bunch of interrupt events with the cpu hardware. PGO is compiler optimizations, which are guided by the profiles of your code collected during its runtime.
The decision to not inline a function is a consequence of multiple factors. The overhead associated with calling the function and invoking the IO dot read all function not getting inlander can be huge. It would be quite nice to inline the I o readall function and accordingly leverage the runtime benefits of it.
The new binary has more amount of inlining into it for the sake of better runtime benefits. With PGO, the average runtime actually reduced or improved by roughly 2%, 1.88%. Behind the scenes, a lot of other internal functions might be getting further in line because of this profile guided optimization.
Let's play with profile guided optimization. Let's first build our code with nothing, with the, let's call it main. With, sorry, main, no pgo. And finally execute main with PGO. This time let's actually perform benchmarks.
So we explored the process of compilation, how compilation can be made more effective with feedback driven way. You can find all of these slides and all of the associated content at the link below in my GitHub under my GitHub profile and let me share the references. Finally, feel free to connect with me.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, welcome to Con 42 Golang. I hope you had a great time with all the talks until now. I am Yash and today I am extremely excited to be talking about profile guided optimization and how it can boost your code and your compiler. So moving on, let me first of all introduce myself. I am Yash and currently I am working as a software engineer at Red Hat. I am also pursuing a masters at University of Waterloo in Canada and around my work I deal with Openshift, kubernetes, Golang, containers, docker, cloud, native stuff. Apart from that, in my free time, whenever I get some time I do open source dev. I love to do long distance running and I like to just build stuff. So moving on, let's first of all talk about compilation, right? So computers, as you all know, don't understand Golang, python, Java, all that stuff. Computers only know zeros and ones, that's the binary language. So there is, there should be an entity which translates a Golang code to zeros and ones so that the computer can execute it. And that entity, as most of you would know it, is called the compiler as shown in the image below. You can see that on the left hand side there's a Golang code printing hello world and there's some compiler magic happening which produces the machine code which is the binary or assembly, and that is fed to the linker to produce the actual binary or the executable which you execute to print hello World. Across this talk, we won't talk much about the linker, we would care and talk more about the compiler. So let's move on and let's dive a little bit into the magic inside the compiler. Magic. So compiler is an amazing piece of tech. I mean it's doing so many things so gracefully and beautifully, so behind the scenes it does a bunch of steps. Firstly, it breaks down the code into a bunch of tokens through the means of a lexer. That's that process is tokenization and then that is fed and a tree is created out of that set of tokens and that tree is called abstract syntax tree, which is a representation of your code. That tree then goes through a phase of semantic analysis to determine whether the type checking is happening correctly, whether a code is correct, all that stuff. And after that a code is converted into a platform independent representation called IR, which is intermediate representation, and that is fed to the backend of the compiler where a bunch of optimizations happening happen and a lot of magic happens, which by the way, we will dive further into the talk. And finally, this optimized IR is fed to the machine code generator which actually produces the machine code and yeah, feeds it to the linker to generate the executable. So let's talk about the optimizations. Sorry, the magic we care about which is the compiler optimizations. So the thing is that the code which you write, the code which I write it might not be the most efficient piece of code behind the scenes, meaning that there can be a bunch of efficient optimizations which can be made to it, and compiler takes care of it for us. So if I give you an example, imagine your code has an if statement which will never be executed. For example, it says if false print hello world. Now you and I, we both know that this if statement is absolutely useless. And the compiler also is smart enough to go through these lines of code and see that hey, this if statement would never execute, so let's not even include it during the compilation. So in this manner, the compiler actually ends up ignoring this small chunk of code and ends up producing a much lighter executable or binary. Now this was one trivial example, but the compiler actually like applies a bunch of crazy optimizations, and the ultimate benefit is that you end up with the lower executable, sorry, lower size of the executable. There are lesser number of instructions and code jumps happening in runtime because of lower number of instructions. And also the compiler can optimize your code during the compilation on the basis of the underlying hardware on top of which your code would run. For example, if your code is doing a lot of GPU programming and the hardware on top of which the compiler is running is also based out of a lot of GPU hardware, then the compiler can really make use of this information and optimize your code accordingly. With the help of the this power of compiler and optimizations, you can actually write a very clean piece of in a very beautiful and readable piece of code in an abstracted manner, and you won't have to be at the performance over it because the compiler would deconstruct these abstractions behind the scenes. So let me give you some examples of these optimizations. One optimization is pre calculation of constants where if you have an expression like a is equal to two multiplied by three plus five, the compiler during the compile time itself can calculate this and say that hey, there is no need to compute this again and again in runtime. Let me just compute it and straightaway feed it in the compilation phase itself. Similarly, there's a process of loop unrolling where the for loop is unrolled further. So that in runtime, when the for loop runs, it does not have to check the condition phase of the for loop multiple times. That also saves in a bunch of cpu cycles. And finally, there's a an optimization called Dead Star elimination where a bunch of useless code is ignored. For example, in the right hand image you can see that the variable is constantly getting updated. But all of those updates don't matter because the last update is going to overwrite all of the previous updates. So the compiler is smart enough to notice this, and it ignores all the above two lines and only considers the last line during the compilation. So the compiler does a lot of these cool optimizations, but we are interested in one optimization, and that is inlining. And this talk is going to be centered around this optimization. So inlining originates from a problem. The problem is the act of calling a function. All of us write a code in a pretty functional manner. We define functions, we call those functions multiple times. The problem is that in runtime, the act of calling a function is a slow operation, because when you just call the function, when you just invoke the function, a bunch of stuff happens behind the scenes which adds up to the runtime over it. For example, a new stack is like dedicated for the scope of the function. The parameters are pushed onto that stack whenever the function is completed with the execution, the returned values are returned back to the caller. So all of these things add up to the performance overhead in runtime. So the what the compiler does is compiler performs this optimization called inlining, where the compiler takes the body and implementation of the function and just places it wherever that function is invoked. So if you look at the image below, we have a function called sum, which sums up two parameters, x and y, and we have a main function where we are invoking this function twice. So if we perform inlining here, the compiler can actually just strip off the definition of the sum function and just replace the invocations of the sum function with their actual implementation. So sum one comma two becomes one plus two, some two comma four becomes two plus four. And the beautiful part is that this newly inlined piece of code can be further optimized through, say, pre calculation of constants to actually know in the compilation phase itself that res one is equal to three, res two is equal to six. So from a runtime perspective, things become very much more efficient, and that ends up elimination, eliminating the functional colleagues as well. But we have a problem. Nothing is perfect in word. What if you have defined a function and it has too many invocations? If there are too many invocations, then the inlining process will be too much if we perform the inlining blindly, and that would lead to too many new lines of code. This means that the code size would increase massively and that would lead to a bloated binary, and that ends up having some bad consequences in the runtime. For example, if I really go into the depth, then uploaded binary, basically behind the scenes means more number of static instructions and the higher number of static instructions in a binary in an executable, the bigger the instruction caches, and that leads to a bad cache hit ratio or cache it rate, and that leads to instruction cache message, which is quite bad in runtime. Moreover, a bloated binary causes page faults and whatnot. And if you are running your binary on a small device like a raspberry PI or something, even thrashing could happen. So blindly inlining a piece of code can be a bad situation. I have shown that as an example on the right hand image as well, that inlining a 506 lines of code, uh, can result in a 1000 lines of code result. So we have a weird problem here. Now if we do less in line inlining, we have a bad runtime performance because of the overhead introduced due to those function calls. And if we do more inlining, then we have a bad runtime performance due to the bigger executable, the page faults, the instruction cache misses and whatnot. So what to do then? That is frustrating. Well, we just need the right amount of inlining. So what does it mean? Ideally, we want to inline only the hot functions and not inline the cold functions. So the hot functions are the one which happen to execute a lot more frequently in runtime, while the cold functions are the one which don't execute that often in runtime. So the benefit of inlining hot functions is that these functions would execute a lot of times in runtime. This means that if you inline these functions, then you end up avoiding the function call overhead in runtime. So this gives you the benefit, the runtime benefit of inlining. You don't need to inline the cold functions because they are not executing enough amount of time, so there's no benefit in lining them. So by avoiding to inlining the cold functions, you actually end up saving on the binary size and the instruction count, and save yourself from all those bad stuff associated with excessive inlining. So ultimately we just need a sprinkle of inlining. And the problem is that compilers don't know a lot. When a compiler is compiling your code, it's literally reading across your code, it does not know a lot more information. So. And just your written code is not enough to tell how frequently a function would execute in runtime. Clearly compilers in more info. And what if we have a solution? The solution is that what if the compilers could look at the application in runtime and learn from that application, learn that which functions are hot and which functions are called? Or in other words, from a more implementation perspective, what if your applications runs in runtime and some somehow you collect various numbers and metrics about its behavior in runtime and finally feed those numbers and behavior as an information to the compiler. Next time you compile our code. So this with the next time you compile our code, the compiler would know that hey, this function is hot and that function is called. So let me just inline this function. So this kind of looks like a feedback loop, right? And that brings us to feedback driven optimization. So as the name suggests, it just teaches the compilers how and where to optimize the code on the basis of a feedback. Now that feedback can be benchmarks, user traffic profiles, all that kind of stuff. That feedback ultimately tells the compiler that which function is hot, which function is cold, and ultimately the compiler just does the right amount of inlining. Now the early days of feedback driven optimization were kind of governed by instrumentation based process. So what is instrumentation? It's very simple. With instrumentation, what happens is that whenever a code is compiled by the compiler, the compiler actually injects extra lines of code within your code. And what are these extra lines of code? Some timers, some call counters, all that kind of stuff. And the purpose of these additional lines of code is to track and instrument the behavior of your code in runtime. So once your code is compiled with these extra lines of code, a bunch of benchmarks are run against your application or code. And through the means of these extra lines of code, while these benchmarks are running, a bunch of information gets instrumented and collected, and this information becomes the feedback for the next build of the compiler. This information ends up representing how your application is performing in runtime. Now this sounds good, but is it really that efficient or that reliable? Because the thing is that now with this whole process, the code becomes much more bloated with all those compiler introduced instrumentation. And we have an extra benchmarking step, which just like makes the entire process of compilation and building quite slower and boring. And the biggest problem is that what if the benchmarks don't even resemble the reality of how your code runs in production, for example, right? What if it's actually quite opposite. And the benchmarks might end up introducing some wrongfully assumed optimizations and that can cause some really bad performance degradation. So a potentially hot function might be perceived as a cold function by the benchmark, and your compiler would not optimize it rightfully. So what do you want? Let's talk first principles. We want faster build times. We want realistic runtime data instead of benchmarks pretending to be real. And finally, we want light and small executables with no extra lines of code due to instrumentation. If you hate benchmarks, don't use them. So what we can do is that instead of benchmarks we can actually use the actual behavior of your code as a feedback to your compiler. And the beauty is that if you do that, you have now become independent of the benchmarking process as well, with which now you can just compile our code and that's what you are done. You don't have to execute the benchmarks, and this gives you the faster build times. Also, this is more realistic because you're not relying on any pretentious benchmarks pretending to be real users. So how do you do that? To be specific, it's sample based profiling. So the beauty of profiling is that it tracks the runtime behavior of your code without needing those extra lines of code inserted during for the sake of instrumentation. And when I say profiling, I mean sample based profiling, because like if we go really technical, then instrumentation is also a type of profiling. So what I'm talking about here is actually sample based profiling. But across this talk, whenever I say profiling, just here, sample based profiling. Alright, so yeah, how does it work? It's pretty simple. Now, just because we don't have those additional extra lines of code in the, in our code during the compilation phase, we need some external entity to poke and monitor and profile your application. And this is where the kernel and enters into the picture. So imagine an application is running in user space. How does it get profiled? Well, the kernel has a beautiful component called Linux perf, and the Linux perf schedules a bunch of interrupt events with the cpu hardware, and that's that hardware inside the cpu ends up scheduling and triggering those interrupts. And whenever those interrupts are triggered, the interrupt handler captures them inside the kernel, and the interrupt handler correspondingly pokes your application in user space and gets the runtime data at that point in time. Now that runtime data includes instruction pointer and call stack, which is enough to tell which function is executing at that time, which parameters are allocated in that function, what is the memory footprint? What is the resource footprint of that function? So a bunch of good profileable data, right? And that's it. Once all of this data has been gotten, this data is stored somewhere to be ultimately used by, for whatever purposes. And how we leverage that into the feedback driven optimization situation, we get profile guided optimization. PGO is compiler optimizations, which are guided by the profiles of your code collected during its runtime. So again, the same thing. Compiler is going through a feedback loop of, you know, the application runs, collects runtime behavior and feeds that to the compiler, and compiler optimizes it accordingly. Here the feedback is just the profiles, sample based profiles collected during runtime, right? So now talk is deep, let's walk inside the code. So let's take an example. Let's say we have a very simple server which exposes a post endpoint and which is called slash render. And whenever you call the post endpoint where the body is the markdown file, you get a rendered markdown in response. So an HTML rendered markdown, let's say this server is running, serving a bunch of users out there, and you want to compile to optimize just rightfully to it, depending on the usage patterns. So first of all, let's just compile this code without PGO, without profile guide optimization, and see what are the inlining decisions it takes. So after compiling this code, you can see that it's performing a bunch of inlining. I mean, it's quite readable here, but if I expand it further, specifically these method calls, these function calls are getting inlined. But if you notice carefully, there are two places where this inlining is not happening. Now, the decision to not inline a function is a consequence of multiple factors. One reason is that if a function is a non leaf function, it's like imagine a tree where each function leads to a child node. Each function call leads to a child node. If a function is a non leaf function, there's a good chance it won't get in line. Not necessarily, but a good chance. So probably that might be the reason why these two functions are not getting in line. But if you really think about it, inlining could have been useful here because these functions happen to be a part of the render function, which get always executed whenever the API is hit, whenever that endpoint is hit. And imagine thousands of users hitting that API again and again and again the function call over it. The overhead associated with calling the function and invoking the, for example, IO dot read all function not getting inlander can be huge, so it would be quite nice to inline the I o readall function, for example, and accordingly leverage the runtime benefits of it. But again, this is a piece of information which only, which can be only known if you perform the code in runtime. Actually, if there are barely any users, then it doesn't make sense to inline this function, right? So, so let's proceed and let's run and profile the program. So in the first image I built the program, then I built the binary without PGO, and I exported the binary as main nopigo. I ran the no pgo binary, and then I ran the server by this binary, and then I ran a load test generator against this markdown. And once I started executing the load behind the scenes, I opened up a new terminal, new session, and I just started profiling the server for 30 seconds just to collect the data associated with it. And finally, when the new set of profiles were generated after the 30 seconds of profiling, I stored it in a file called default PGO. Now the so ultimately we ended up with these files like main dot PGO, which was the binary, roughly 8.5 megabits large, and we have now cpu, nope, go pprof of, or the default PGO files. Both of these are the same files. These represent the profile of our application running in runtime facing alleged user load. Now let's compile our program with PGO, and as soon as you notice the dash PGO auto flag here, this basically tells the Go tool chain to use the default PGO profile as a feedback to compile the code. So this default PGO actually told the Go tool chain that how, you know, how frequently the load was hitting the render function and the IO dot redole. And as you can see, Golang now finally decided, the compiler finally decided that it should inline the I O dot read all function as well, which wasn't getting inland previously, by the way. And, and to be honest, this is not that, I mean behind the scenes, a lot of other internal functions which might be getting called by internal libraries might be getting further in line because of this profile guided optimization. And we can actually confirm that if you list the both the binaries before and after PGO, you can see that the previous binary was smaller and the larger. And this newer one is larger in size, roughly, I mean, 0.2 megabytes size larger. That's because the new binary has more amount of inlining into it for the sake of better runtime benefits. So of course its size is going to be larger as well, because with inlining the code size increases, which is getting in line, right? Because instead of function invocations into your binary present, in your binary, instead of function invocations, you would actually have the definitions of function, which wherever the functions are in line. Alright, so now let's load test the old and new binaries. Again, as a part of first step, I ran the old binary, which is the main nopigo, and I ran a bunch of benchmarks and I basically dumped those benchmarks into a text file, which is a Nopigo test file. I did the similar thing with the binary with PgO and I dumped its benchmark into a separate text file called withpego. Txt. So ultimately we have two txt files. One, one file contains the results of the benchmarks of old binary, and the other one contains the benchmarks of the new binary. And one thing to be noted here, there is no change of code in both these binaries. No change of code. You I did not change anything. So it is just the magic of profile guided optimization, which we are about to witness. So as you proceed further, if we use benchtack to compare all of these, both of these text files, you would actually notice that with PGO, the runtime, the average runtime actually reduced. Runtime performance actually reduced or improved by roughly 2%, 1.88%. Now, I know it's not much, but the thing is that we just simulated a very slight amount of load and this was a very random, simple application. But imagine a fully fledged server with thousands and millions of users. In that case, some serious optimizations can be performed and this, this small difference of 2% can be substantiated further to even five or 10%. You never know, right? And the best part is that the amount of effort involved with this improvement was nothing, it was negligible. I did not have to change a single line of code and I got this benefit out of the box, right? So let's do one thing, let's get our hands dirty. I mean, slides and all are fine, but let's get our hands dirty and actually play with profile guided optimization, right? On terminal. Alright, so pop open my terminal and you can see that I have a bunch of files here. So, alright, so let me do one thing, let me clean up this stuff, let me just clean up this stuff here, let me clean up this stuff. And so I'll first of all clean up all the text files. We don't care about those results anymore. I'll clean up the old binaries. All right. Main with big. Oh, I'm just cleaning up all the old binaries. And let me clean up all the old profiles as well. All right, so we have nothing. Now we have just, and of course let me remove the default p o file as well. So we are at scratch now. Okay, we are at scratch. We don't have anything. Now let's first build our code with nothing, with the, let's call it main. With, sorry, main, no pgo. And we are compiling the main door go file. Simple. Now with this we have compiled a code and if I run this, sorry, if I run this as this is running. Alright, now let's go to separate terminal again into a separate session. And actually let's run the load test. Okay, let's run the load test now here the load test is running behind the scenes against an application. All right. Now if I go here and if I further start the profiling process, let's say 30 seconds. Now for 30 seconds I'm, I've started the compilation or profiling process. And with this profiling process, what's happening is that behind the scenes, my binary, my server is getting hit by requests by this load. This means that is actually facing real time user traffic and that's actually simulating real time user traffic. And the profiling process is capturing all of the information about, you know, at each and every time. What is the memory footprint, what is the cpu footprint, what is the resource footprint of the binary execution. So as you can see, we are done and we have the profile ready here. Let's just rename our profile to know to default P O because that's something which is recommended by the Go tool chain. Of course you can have a custom name as well, but just for the sake of convenience, I put it up here. Now let's do one thing. As you can see, we have the default p go here. Now let's compile it again main. If I do the compilation process again, this time let's call the output binary as main. With PGO, we're doing main go and we will introduce the PGO automatic flag. This basically tells the compiler to read the default PGO file as a feedback for performing the compilation. And let's see, it's definitely taking some more time as compared to last time because again, now it's just going through more amount of scanning through the profiles and accordingly taking the rightful amount of optimization decisions. And yeah, it's just waiting, it's building a lot of suspense. But as you can see, it's doing a lot of compilation and whatnot. And yeah, it's taking a lot of time. So behind the scenes. Yep, we have stopped all the stuff. So yeah, yeah, it's not built and if I do a version M dot with ego you can actually see that it did. Consider this default PGO file while doing the compilation. Alright, now let's run this main with PGO file. But before that let me show you something interesting as well. So let's look at the sizes. I want to do an ls. Why do list? Yeah, just notice the size of actually, let's do a grep main as well. Then notice how the size of the one with PGO is actually larger and the one without Pigo is smaller because the one with Pigo actually told the compiler to perform more inlining, which led to the increase in the binary size. And finally let's execute main with PGO. This time let's actually perform a bunch of. No, let's perform benchmarks. Let's do some benchmarking here. So I'll actually start the benchmarks here and these benchmarks while these benchmarks are happening, let me tell you what's happening. The benchmarks are running and they're basically firing a bunch of load against binary with PGO and they're storing the results. The benchmarks are storing the results inside this file called with Pigo Txt. So right now we are doing this with PGO. Once these benchmarks are done, we will do these ones without PGO and then we can use a tool called benchtat to compare our results. Alright, so let's wait for a few seconds and this should be done in anytime soon. And there's nothing running in the second terminal that the way we generated the load previously was only for the sake of generating profiles that said nothing else. Right now we are doing benchmarking. So we are doing it in a different manner. And as you can see the benchmarks are executed. Now let's close this binary and run the one without PGO. And as soon as we run here, let me just name this file as nope. And again, this time as well we are running but this time across these benchmarks we will snapshot these benchmarks in a separate file called not nope Exe. And as I said, we will separate out these files and then benchtat against them to compare the real numbers that how things happen in runtime. They might be slightly different from what I showed in the PPT because again the load is kind of non deterministic in that manner. But yeah, let's see how it, how it happens. So it should be done anytime soon. We have nothing here and the no PBo file is running here. Yep, it's done. Now if you do bend status, no pgo txt and with pgo txt you can see what a beautiful difference it is. The one with no PGO, your average time was 309 microseconds. With PGO it was 301. This is even a higher significant difference, even even a more significant difference with PGO, which is 2.33 as compared to the slides. So let's go back to the slides and just make some concluding notes. So that's pretty much it. I guess we learned quite a lot and see how profile guided optimization can actually benefit us a lot. So we explored the process of compilation, how compilation can be made more effective with feedback driven way by feeding it some runtime data and our sampling profiles based. Profile guided optimization works even more effectively in our favor and we actually got a very practical perspective by getting our hands dirty by playing with profile guided optimization with a server like code resembling a real life scenario. And you can find all of these slides and all of the associated content at the link below in my GitHub under my GitHub profile and let me share the references. So there's a bunch of references I used to learn about this topic and to inherit the content of these slides. Of course, this was just meant like I just sat on the shoulders of these giants who implemented all of this cool stuff. Finally, feel free to connect with me. All of my handles are given here. And that's all folks. Thanks a lot for your time. I appreciate you giving me your time and attention for this, for attending this talk, and feel absolutely free to raise any questions or reach out to me whenever. Hope you have a great day.

See all 21 talks at this event!

Conf42 Golang 2024 - Online

April 25 2024

From Slow to Go: Boosting your code with Profile-Guided Optimization

Abstract

Summary

Transcript

Yashvardhan Kukreja

Software Engineer (SRE) @ Red Hat

Awesome tech events for

Featured event

2025

2024

Info

Conf42 Golang 2024 - Online

April 25 2024

From Slow to Go: Boosting your code with Profile-Guided Optimization

Abstract

Summary

Transcript

Yashvardhan Kukreja

Software Engineer (SRE) @ Red Hat

Awesome tech events for