Go Performance Unleashed - Memory Model, Profiling and Optimization for your Go apps

Abstract

Have you ever asked yourself why Go is fast and how to squeeze more speed from you Go apps? This talks dive deep into Go memory model and runtime scheduler and which best practices to use to make your Go code lightning fast.

Summary

Marko is a software engineer at AIon. He is a master degree student in AI at University of Pisa. He will talk about everything that is related to cloud, native and kubernetes. As a first topic, he wants to go deep down through the go runtime scheduler.
Go is a fast language, but why? What makes go stand out in the gorgeous realm of programming languages when it comes to performance? Two fundamental aspects of go are the memory model and the runtime shutter.
How can you measure the performance of your go application applications in a systematic way? How to write a benchmark? It's totally up to the go runtime to decide how many times to iterate your benchmark. To compare benchmarks, the benchmark function must be the same.
Profiling is the process of keeping track of all the inner functions that your main function runs. It allows you to track the cpu and the memory of all instructions. Any benchmark should be careful to avoid any kind of optimization optimizations. Always keep track of memory usage as it can actually cause garbage collection.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everybody, it's Marko here. This is my first time as a speaker and I'm glad to be here at Conf 42 to bring some goodness in go. So let me introduce myself. So I am a software engineer at AIon. I'm a master degree student in AI at University of Pisa, and I'm into everything that is related to cloud, native and kubernetes. I also participated at closer 2023 Cloud conference and fortunately I won the best paper award and the work I proposed in that conference was something related to how to solve smells in a Kubernetes microservices environment. So let's go ahead and I'm going to introduce you some topics that you might already know, that more experts might already know. But still, I think it's still interesting to catch. As a first topic, I just want to go deep down through the go runtime scheduler and the memory model and how they are impacting the performance of Go applications and then how to measure your applications and find bottlenecks, if there are any, and then some best practices. But first things first. So ghost is a fast language, but why? What makes go stand out in the gorgeous realm of programming languages when it comes to performance? And to ask to that question, we need to look under the hood and examine two fundamental aspects of go, that is the memory model and the runtime shutter. So, but let's switch from go a little bit to Java. Okay, so some of you already know that Java uses native thread in us, okay, operating systems, right? So that means that every Java thread is mapped to one kernel thread. In this way, Java cannot determine which thread would occupy the core. And this is completely up to the OS shadware. So it's completely dependent on how many threads you have, right? So a problem could be, for example, if I am executing Java thread inside a certain OS kernel thread, I save the state and then the java thread is scheduled into another OS thread. Then I would suffer from context switching. And we will see an example. Let's do a Java example here. I have a function that does something as we, as we see for x times, and we have a number of threads that are executing this, this function, this number of times. Now if I run that code with 100 thread rounded threads, okay, and with 1000 threads, we have different results, right? So you can see that in the picture. Basically when the number of threads is set to 100, about 51% of the cpu time is spent in the function. Do something. So our real function. But then when you increase the number of the tries to, to 1000, the cpu time spent for the actual function went down about 27%. And all of these metrics basically are telling us that the cost of the thread, the Java trading model, suffer in nine concursive scenario. So in standard operating systems, the threads are scheduled after a certain amount of time. And when a timer, hardware timer, interrupts the processor, the OS kernel suspends the current executing thread, save its state in registries. So when it has to resume it, it doesn't lose anything, and then it finds among all the threads available for being executed the next one to one. And as I said, this is called the context switching process. And this process is kind of slow even, because as I said a couple of slides before, there could be like a catch miss. Okay, so this is the main reason why the co founders created the go runtime shadow. Okay, so go doesn't totally rely on the OS shader, but it has its own runtime shadower. And it uses a threading model called mn trading model, where basically m coroutines are being shadowed among N OS threads. So as we can see here, there are kind of, let's say interfaces between the coroutines and the actual kernel threads. So the kernel threads in this picture are the white triangles and the go teams are the one. The goutine can be green in this case, that means that is actually running if it's red is in the queue, but that yellow boxes act as an interface. And those yellow boxes are the context. Okay, they are fundamental. So once our context has run a goutine until a scheduling point, it pops a gootine of the queue, sets the stack and destruction pointer, and begins running the goroutine. And what can happen is that a yellow box, in this case a context, is running out of goroutines and automatically, without basically calling any kind of interrupts it automatically still work, still go teams from other contexts. And this makes sure that there is always work to do on each of the contexts, which in turn makes sure that all the threads are working at their maximum capacity. So taking advantage of this runtime shadow go was built upon a CSP model. CSP stands for communicating sequential processes, where basically we have goutines and these go teams can communicate each other via, via channel that can be buffered and buffered. But the cool thing is that we as a developer, we don't have to care about basically accessing shared data, so we don't have to set matrixes or semaphores or locks or queues and other things like that, sync variables and other things like that. So this is really, really cool. And go, this is something that I really appreciate because I love writing pipelines, for example. So for the pipelines, this kind of model is really, it's really cool even, because the way the two go teams, as they are not threads, are communicating, is really, really, really fast. Let's go back with Java, and this is a comparison. So we have the, this, this code here we are doing a slip of ten minutes and we, we're using 1000 os net Java threads, right? And at the end of the execution we can see that there are actually 1000 threads plus 18, and that 18 is the number of threads occupied for handling the JVM. And the real power that we encounter when we execute in code is especially on the number of threads used, because the same code actually needs only two threads. And this is, this is really cool. We said that a couple of steps before that the Go teams saved their state, right? So Goat saved the Go team state. But let's do a comparison on how all threads are saving the state and how the Go team saved their state. So the host threads have a fixed style stack for saving the state. But this is kind of a problem because it is kind of a huge state waste of space. If we imagine that there is a single goutine, but at the same time can be too strict for hundreds of thousands of goutines that are being created, which is not a real event in a go based application. Therefore the go memory, memory model work in another way, maybe because we want to have a good amount of coroutines in contrast to the approach that we saw before, go creates very small stacks for each coroutine around two kb. And the surprising thing is that each of these stacks is growing and shrinking as needed. And this is possible because of the capability of these stacks to borrow memory from the heap. And the fact that these stacks are dynamic allows us to have better memory management. And as the memory hover is reduced when context switching happens, and that's because of the small states we have to save for each goutine. So this was like a sort of introductory topic for the new bio, for the new gophers. But once we, let's assume that you now know very well this concept, and you started take all the goodness off go for your applications. So now how can measure the performance of your go application applications in a systematic way? Before starting explaining how to benchmark the applications, I want to do some preconditions. So the preconditions that every time that we run our benchmarks, when you run benchmark you want to always keep the same environment. We don't want to get affected by the external environment. So another thing that is crucial to do is to isolate the code that is being benchmarked from the rest of the program. So how to write a benchmark? So this is a practical slide. So we create an underscore test go file where we put all the benchmark functions and the benchmark function as a specific signature. And we have to specify the b variable that has type testing b. And as we're gonna see right now, for example in this two, sorry, as we can see later on, that this b dot n represent the number of iterations that dynamically, the go, the go runtime. Decide for your function to run. Okay. So you don't have to touch that variable at all. Okay. So it's totally up to the go runtime to decide that how many times to iterate your benchmark. So let's take two functions. Here we have two pipelines. So the pipelines are, these two pipelines have the same structure. They are a producer that is producing some strings. Then there is like some stages where basically we take all the screens, the strings, we lower them, then we merge the results from the old coroutines and then we send them concurrently into the other stage, which basically takes the, take the string and capitalized, capitalize the first character and then merge all the result. Right, so they have the same structure here, but they are different under the hood. Right. So we want to test benchmark these functions and we now can do a smart thing since we, we want to use a tool called bench start. So bench start, what it does is to compare to benchmark results, okay? But in order to compare, to compare these benchmarks, the benchmark function must be the same. Okay, so a cool strategy could be to first create your benchmark, then call the function, for example, rampipeline one, you produce the benchmark as we do in the first, sorry, in the second image. So return the output as a before bench. And then we can go back to the same benchmark function and change the inner function. So instead of rank pipeline one, we're on pipeline two. Okay, so the result is in the first is in the number one code. Okay, so now we have two benchmarks. But before doing the bench start and comparing them, we can open one of them. For example, we can open the whatever the before dot match. Okay, so how to read a benchmark? So we have four values. And let me switch back, because an important thing is that we have to see those flags here. So the minus run equals to x means that basically what the benchmarking engine does is. Okay, I see the run pipeline function in the benchmark function. So I need to run all the tests related to that function, and I'm lazy, I didn't want to do that. So with the minus run equals to x, you avoid that. And also another important flag is the minus benchmam that allows you to keep track of the memory used for your functions. Now we can go ahead. So with that command we get this result. Okay, so we can see in the red circle that there is the number of iterations. So the b n that we measured before. And the blue circle is one of the most important because it explains to you on average how fast your progression is in terms of nanosecond per operation, the number of bytes per operation, and the number of allocs per operation. And the last two results came from the fact that we use the minus benchmam flag. So let's go ahead and use bench sort, which is a really great tool to compare to benchmarks. And as we can see, okay, so the before bench was running the run pipeline one, and the after bench was running run pipeline two. And we can see three rows. Okay, the first row is the speed, the second row is how much memory used your function, and the third row is how many allocations your function did during the benchmark. And we can see that the run pipeline too is faster by 40%. It wastes less memory, a lot like 40%, and it allocates less than the ram pipeline one does. So almost 86%. So this is kind of a systematic way to measure your applications. But now you maybe are wondering why the RAM pipeline one function is slow, since the structure is the same. So here comes profiling. Profiling is the process of keeping track of all the inner functions that your main function runs. And it allows you to track the cpu and the memory of all instructions. For doing that, we use a pprof. Pprof is a tool for visualize the profiling data. And it's available for free, of course, as a, as a go tool, and it's based on protocol buffers. So in order to exploit this pprof, now we want to add two more flags, which is the cpu profile and the memory profile. And we load the result of the profilings, one for the cpu and one for the memory, into two different prof files that are basically protobuf type files. Okay, what if we run the pprof? As we can see above the line above, there is go tool pprof cpu one. Prof. Now if we go back this like before, there is no one at the end of the minus bench flag. That means that all the benchmark functions that prefix starts. Okay? And that case I put basically the benchmark pipeline one and separated the benchmark pipeline two so that we can compare using the profiling, the two functions. Okay, so using this command, I'm inside the pprof and I can run the top 100 command. So the top 100 command gives me the most, the most 100 cpu expensive tasks. And I can see that, for example, there is a sleep and we have the strings dot lower function. Another important thing is that in the top list there are basically only stages that comes from the run pipeline one function and that, and we're not seeing any kind of function from run pipeline two. And that's because of course we saw before with the benchmark, that run pipeline one runs actually slower. Okay. Since this list is in descending order of how much time has been spent. So we are a little bit suspicious here. So, and instead if we scroll down, we can see our fast stages. So we are a bit suspicious here. So what we can do is to dive into the code of ramp API one. So we run again the, the pprof and we can use the list function, the list, sorry, the list method of pprof, where you can basically say which function you want to analyze. In this case, I want to analyze the first stage of the run pipeline one, which is transform to lower one. And if I go inside apart from the code above. Okay, we have to focus on the red box and seems that we are losing time here because we see that we are doing a loop where we are lowering chart by chart and then return the result. But this is kind of a problem because we know already that there is a function that is coming from the strings module that is lowering our strings in a faster way. Okay, so let's analyze the faster one. So as we can see, we have the strings lower function that is coming for the module and it takes 26 seconds. Okay, so overall we saved almost 5 seconds, you know, because here we have to take into account this, this amount of time. So 70 seconds. But we always, we also have to take into account the time that has been spent to send the data to your channel. That is almost 13 seconds. Okay, so here's why there's the difference. So it's 13 plus 17, okay. Rather than 26 overall. Okay. We found out why our function is lower, okay. But we now analyze the cpu profile. Okay. But we also produced a memory profile. And as you can see, for example, you is can that in the slower function we have a waste of memory and we have enormous quantity of allocations. So of course this is what I wanted to show you. So the systematic way to find out how, where your program is low, where your program is fast. Okay, so, but before concluding, a suggestion is that I want to like that to be completely accurate. Any benchmark should be careful to avoid any kind of optimization optimizations that the compiler does. For example, if we, if we don't save the result of the ram pipeline, sometimes it happens that the compiler eliminates the function under the test and somehow it's artificially lower the runtime of the benchmark and we don't want that. When utilizing the profiler, it's important to consider that it samples both cpu usage and memory at specified frequency. Okay. However, this sampling may not always be 100% representative, particularly if the sample time is very low. So to announce accuracy, I recommend increasing the bench time parameter accordingly. Okay, so in this way you allow the profiler to collect more samples over an extended period and then we can obtain more precise insights into how our application performance. So of course try to design your applications as a pipeline of coroutines. Always keep track of the memory usage as it can use, actually can cause garbage collection to run and therefore that means that we potentially wasting time. And of course, as I said, as in the precondition, try to execute benchmarks on a stable machine without having spikes during the test. So here I left some study references that I really found that interesting. So how gouting works, how the memory model works and the shadowing works, and how you can use benchmark and profiling to improve your functions performance. So that's it. Thank you for your attention. Hopefully this was an important way, an important moment for you to learn something more about go. So I'm really excited to hear you in the comments and so that I can improve my speaking skills as this is my first time speaking in public. So really speaking with the bottom of my heart, really, thanks. And I hope that this is the first of an infinite number of speakings. Okay, so thank you guys and have a great day.

See all 21 talks at this event!

Conf42 Golang 2024 - Online

April 25 2024

Go Performance Unleashed - Memory Model, Profiling and Optimization for your Go apps

Abstract

Summary

Transcript

Marco Marino

Software Engineer - Core Automation @ ION

Awesome tech events for

Featured event

2025

2024

Info

Conf42 Golang 2024 - Online

April 25 2024

Go Performance Unleashed - Memory Model, Profiling and Optimization for your Go apps

Abstract

Summary

Transcript

Marco Marino

Software Engineer - Core Automation @ ION

Awesome tech events for