Conf42 Golang 2023 - Online

Heap Optimizations for Go Systems

Video size:

Abstract

Go programs are susceptible to severe performance regressions at large scale due to garbage collection (GC), resulting in degraded user experience. Learn about how Go GC works, and how to lower it’s impact on your program’s performance!

Summary

  • Nishant Roy is the engineering manager for the ad serving platform team at Pinterest. He will talk to you about heap optimizations for Go systems. After this session, you should have a good idea of how to triage.
  • Go does not require users to perform any manual memory management. The garbage collector is able to run concurrently with your main program without using a stop the world pause. As memory pressure starts to increase, the garbage collector suddenly needs a lot more resources. This can really hinder the performance of your program itself.
  • So typically, if garbage collection is the reason for your application's performance suffering, you'll see really high tail latency. The next step is to confirm your hypothesis. Go has quite a few built in tools to study our heap usage.
  • The second package that I talked about is Pprof. It allows us to visualize several different system profiles. It is CPU memory usage, heap, et cetera. PProf puts you in an interactive command line tool to start visualizing this data.
  • Lower the number of objects in your heap. This will reduce the amount of time it takes a garbage collection to scan your heap and therefore lower its impact. The second one is to reduce the rate of object allocation. And then the third one is actually to optimize their data structures to minimize how much memory they use.
  • The third thing that we talked about is how we organize and represent our data to reduce the amount of memory that it's using. One way to do this is to clean up any unused data fields. If we remove those, we essentially went from 64 bytes to 40 bytes.
  • Stago memory allocator does not optimize for data structure alignment. If we simply reorder these, as we did in the good object on the right, the memory allocation is much better aligned. This simple method could just really reduce your memory usage and improve your system's performance.
  • I hope this helped you understand how guard garbage collection works and go. And how you can go about optimizing your system to minimize the impact of the garbage collector. If you have any questions, feel free to reach out to me. Have a great day.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. My name is Nishant Roy and I'm excited to be here today at 42 Golang 2023 to talk to you about heap optimizations for Go systems. After this session, you should have a good idea of how to triage, whether your application is being plagued by memory issues, how to track down hotspots in your code, and how to go about optimizing your application's performance. Before we dive in, here's a little about myself. I'm the engineering manager for the ad serving platform team at Pinterest, and our team owns multiple critical systems that help power Pinterest's $2 billion a year over $2 billion a year ad delivery systems. Our central ad serving platform itself is implemented in Go and has really high performance requirements, which is why we spend a lot of time thinking about how to scale our systems efficiently. And one of the areas in particular that we spent a lot of time on is taming the impact of the Go garbage collector to improve our system's performance. So I'm here to talk about what I've learned from that experience. So let's start with a really quick intro to memory management and how it works in Go. Memory management at a high level refers to allocating memory for an application upon request and then releasing it for use by other applications once it's no longer needed. The great part about Go is that it does not require users to perform any manual memory management, so users do not need to manually allocate and clear memory. Both these functionalities are abstracted away from them, and the benefit of this is that it minimizes the chance of memory leaks. The Go garbage collector, in order to run it basically has a threshold. So every time that the heap hits a certain target size, which by default is whenever the heap grows by 100% since the last time the garbage collector ran, the Go garbage collector is going to run one more time. This setting is configurable through a config flag, and there are more config flags that have been rolled out in recent versions to make this tunable at a more granular level. So the go garbage collector uses what is known as a tricolor algorithm for marking the objects, which means it divides objects into three different sets. Objects that are marked as white are collectible, since that means that they're not in use in memory. Objects marked as black are not collectible since they are definitely in use in memory, and then objects that are marked as gray, which is the third color, means they may be collectible, but it hasn't been determined yet. So by using this tricolor algorithm, the Go garbage collector is able to run concurrently with your main program without using a stop the world pause similar to some other languages like Java famously used to, which therefore minimizes the impact of garbage collection on your main program itself. So then the question is, how does garbage collection actually impact your application's performance? The Go garbage collector aims to use no more than 25% of the available cpu resources, which obviously ideally minimizes the impact on your program's performance and latency, et cetera. However, as memory pressure starts to increase, which means the heap size is really large, the garbage collector suddenly needs a lot more cpu resources. So it starts to steal resources from your main program, which can then really start to hinder the performance of your program itself. So, for instance, if the rate of memory allocation is really high, then the Go garbage collector is going to start stealing Go routines or threads from your main program to assist with the marking phase in order to quickly and efficiently scan all the objects in the heap and determine what can be cleared up. This does two things. Firstly, it allows us to ensure that the rate of memory allocation is not greater than the rate of memory cleanup, preventing the heap from growing to be very large. Secondly, it slows down your main program itself, which therefore reduces the rate of memory increase as well. So what causes GC to actually run slower? What does memory pressure mean? So, in order to determine what memory is ready to be cleaned up, the garbage collector needs to scan every single object in the heap to see if it is still in use or not. So as the number of objects in the heap grows, so does the amount of time spent scanning the entire heap. Then the next question is, what is actually on the heap in the first place? And the heap essentially is one of two areas that a computer system uses for memory allocation. The first one is known as a stack, which is a special area of the computer's memory which stores any temporary variables or memory allocations that are created by a function or method. Since each function stack is then cleared once it's done executing. If the variables within that function were not moved elsewhere, we would have no way of accessing these variables later on. So that's where the heap comes in. The heap is sort of a more free floating memory region used to store global variables or variables that are referenced outside the scope of function, shared between functions, between packages, et cetera. So how does go determine what needs to go in the heap? There's this process called escape analysis, which is beyond the scope of this talk, but at a high level the way you can think about it is if an object is only referenced within the scope of a certain function call, then we can allocate it to the stack just for that function. The stack will be cleared once that function is complete, and we'll lose that object forever. So you don't need to worry about scanning it, cleaning it up later. But if an object is accessed outside that function, then it needs to be allocated to the heap in order for it to be accessible later on. So that is the essence of escape analysis. So then how does one go about determining if garbage collection is actually the problem for your application? So typically the way this conversation starts is you see that your application is suffering from really high latency issues. So that's your symptom, that's what you observe. Intuition is really the first step towards figuring out if GC is the problem. So typically, if garbage collection is the reason for your application's performance suffering, you'll see really high tail latency. And what that means is we have a small percentage of requests to a system. So again, I'm talking about large scale distributed systems with really high volumes of traffic, enough to get a decent percentile breakdown of latency, which is what Pinterest systems are like, of course. So tail latency means that we have a small percentage of requests coming into our system that result in really slow responses. So we often talk about latency as percentiles. So high tail latency here might refer to really high values for p 99 latency or even p 90 latency. Typically for Gc, what we've seen is the p 99 latency is what really gets affected because of the infrequency of the garbage collector. Running it only really affects that last 1% of requests. So if you're also observing systems like this really high p 99 latency, then there's a good chance that garbage collection pressure could be the root cause. Especially if you already know that your program has pretty high memory usage, which you can tell by just observing various system metrics how much memory is being used on the host that is running your application, et cetera, et cetera. So the next step is to confirm your hypothesis. You can use this runtime environment variable that go makes available, called go debug. By setting it to go. Debug equals GC, trace equals one. As you can see on the slide here, you'll force your program to output debug logs for every single GC cycle. And this will also include a detailed printout of the time spent in the various phases of garbage collection. And then the last step is to take what you measured and align it with your system metrics. So the way we did this was we looked at the logs from Gctrace and if we noticed that the system's performance so there were spikes in latency that aligned with when the GC cycles were occurring, that's a great way to conclude that there's a good chance that GC is the cause of your performance regression. So here's an example of what GCT trace output looks like, with an explanation with a detailed breakdown of every single component in there. Credits to Arden Labs here. If you want to find the blog post, you can just look up GCT trace Arden labs. That's how I found this screenshot. So taking a quick look at this, we see that GCtrace gives us a lot of information. It shows us how many GC cycles we've had so far since our application started, how much of our program's total cpu has been spent on garbage collection, how much wall clock and cpu time was spent in the various phases of GC, what our memory users looks like before and after garbage collection runs, et cetera. Et I'm not going to go too deep into these aspects, but check out the blog post if you're looking for a detailed breakdown of all of these GC components. What I found helpful is really just to let GC trace run in the background. And I added a separate background thread to print out certain key system metrics, things like p 90, p 99, n latency observed over like a 1 minute to 32nd period. Print these out in a regular interval and look for correlations between JC cycles occurring and latency degradations. So let's assume now that we have a reasonable amount of confidence that garbage collection is the root cause for our application's poor performance. How do we then go about profiling our heap usage? So go has quite a few built in tools to study our heap usage, and I'm going to talk about two main ones here. These are the two that I found really helpful. The first one is the memstats library, and then the second one is the PPRF package. So memstats is essentially this library that is built into go runtime and provides you with statistics about the memory allocator itself, things like how much memory has been allocated, how much memory is requested from the system, how much memory has been freed, GC metrics, et cetera, et cetera. I'll dive into that a little bit more in a second, and the second one is pprof which is a system profile visualizer, and we'll talk about that in a little bit more detail as well. But these are really helpful to understand how your application is managing memory and also visually inspect your system's cpu data or cpu usage, heap usage, et cetera. So here's just a really short glimpse into what memstats gives you. These are some stats that I found helpful. Like I said, it essentially exposes these stats about the system's memory usage, garbage collector performance, et cetera, et cetera. So we can use this library to monitor a few different things. What I found helpful is to monitor the total number of objects in the heap. We discussed this earlier, but as the number of objects in the heap increases, it takes much longer for the garbage collector to mark the entire heap to scan and mark the entire heap. So if we notice this metric going up, there's a good chance that GC pressure is going to increase. Similarly, if that metric is going down, we made some good optimizations and the impact of GC should be decreasing. So I used this metric as one of my indicators for success. As I rolled out new optimizations, this metric dropped and I noticed that the system's performance started to improve. And the memsite docs provide a really clear explanation of all the various statistics. I think there's close to 20. These are the three that I use once again. So heap objects number of allocated heap objects heap alloc is actual bytes that are allocated to heap. This is helpful because this is how the go runtime determines when to actually trigger GC. So like we said before, it essentially by default triggers whenever your heap grows by 100% since the last cycle. So that's what heap alloc can be used for. And then lastly, heap sys talks about the total bytes memory obtained from the OS. So actually requesting memory from the operating system is a slightly heavyweight process because it's essentially blocking. So if you're seeing that this number is also continuously going up, there's a good chance that you're continuously having to request a lot of memory, which is also blocking threads and impacting your system's performance. I don't have slides on this, but one new cool feature that Go has rolled out since I made these slides originally is another runtime flag, which allows you to actually set a soft memory limit. So rather than the default behavior of GOGC triggering whenever your heap grows by 100%, you can actually set a target saying only trigger go Gc when my heap size hits x megabytes, x gigabytes, whatever it is, which therefore lowers the number of times GC needs to run, therefore lowering the impact of GC in your application's performance. That's one way to go about it, and can be an easy and dirty way to just tame the impact. However, some of the steps we'll talk about here will really just help you tune your actual heap usage itself, which is likely well, one, it's a good practice, and two, it's likely to give you more consistent and perhaps more significant wins as well. So here's a quick program that I put together on how to use memsats, so just wrote this little method on the right here to read memsats every however frequently you need it. Print out number of heap objects allocated, number of bytes allocated to heap, et cetera, as well as the number of GC cycles that have been triggered. Since this can be really helpful to see how often and how frequently GC is getting triggered. The example I did here is essentially we're allocating this slice of integers or this array of int slices, and you can see how I'll show you in the next slide. You can essentially see how the number of heap objects and heap allocated bytes changes, as well as how the GC counter increments as well. So here's what we got when we ran it. You can see that the heap objects drop whenever we run GC, which is basically the penultimate line in this slide. Otherwise, heap objects continue to increase. You can see that on the last line we see num GC incremented to one, and that's where heap objects dropped. It's a clear indicator that things worked as expected. You can also see that heap alloc dropped very significantly, almost to ten or 11% of what it used to be. So GC did its job, and we freed up a lot of space on the heat. This is a really simple program, but you can use something very similar to essentially understand the memory behavior of even more complex systems. So this is how memsats can be really helpful. The second package that I talked about is Pprof. It's a built in package as well. It allows us to visualize several different system profiles. It is CPU memory usage, heap, et cetera. Here we're going to talk specifically about the heap profile. So the tool comes with a bunch of options to investigate specific aspects of the heap, and those are the ones listed here. So if you were concerned about auto memory issues, you may be interested in inspecting the actual amount of memory used rather than objects, for instance. So you can use the right option accordingly. In our case, we know that GC pressure is what we're investigating. It's tied very closely to the number of objects in the heap. So the inused objects or allocated objects, fields or options are more useful to us here. So the first command shown here, go tool pprof and input your options. Then pass in the URL of wherever your application is running and pass in the API endpoint that you want to hit, which is debug. PProf is going to essentially download that profile data to your machine and puts you in an interactive command line tool to start visualizing this data, and it's really helpful. So one thing I forgot to mention is in order to generate this profile, you do need to register this HTTP endpoint upon application startup. I don't have a slide for that either, but you can just quickly look up the pprof docs on Go's main doc site and it's essentially one line to register this HTTP endpoint and generate your heap profiles. So like I said, when you run this, it'll put you in a command line interface to start playing around with the data. You can essentially run help in your command line tool and command line interface, and it'll show you all the available options to slice and dice this data. What I really like is to run a second command, the last one shown here, which is gotool pprof, pass in the port that you want to run the web UI on, and then the path to the actual profile data itself, and it'll open up an interactive web browser, which I find much easier and more helpful in inspecting heap usage. So to jump ahead and show you what that looks like, here is one of the visualizations that Pprof gives you. It lets you see the number of objects in use by various call stacks, which can be really helpful in narrowing down problematic code. So here it's showing you the entire call stack. The size of the box is roughly proportionate to whatever is allocated in the most number of objects, so it really helps you narrow down in this case if you see buff Iot new reader size is about 45% of our heap allocation. So we can conclude that that is one of the reasons for our heap allocation, or the number of objects in our heap being so high. Then we can trace through that stack and try and figure out what we can do to optimize this. Some options are not creating a new reader every single time we need to use it, perhaps reusing one, pooling them, et cetera, et cetera. This is another visualization that Pprof offers that I actually use really heavily. It lets you visualize heap usage as a flame graph. And this flame graph is also interactive, so you can click on any bar to focus in on it and the call stack below it, et cetera, et cetera. The depth of the call stack doesn't really matter here, but the width of the call stack is what represents the number of heap objects that are allocated. So essentially, the wider call stacks use a higher number of heap objects, at least when this profile was captured. So it's really easy to just jump in to certain hotspots and dig deeper into there to try and find the lowest hanging fruit and the biggest possible optimizations. So I'm also going to show you what the CLI can be used for. So from the previous slide here, we can try and figure out which method or which call stack is allocating a large number of objects. And then through the CLI, you can use this list command, which is really cool to pass in a function name and see line by line which lines of that method are allocating how many objects. So in this one, this is a fake method. But let's say we have a method called create catalog map that is essentially creating this map of products that a particular seller has. We can jump in. We know that this method creates a large number of objects itself. Here we can go in and see line by line, exactly how many objects are allocated by each line in the object in the method, and figure out where to focus our efforts. So here you can see that lines 233 and through 237 create a lot of new objects, which results in a large number of feeb allocations. And then line 241, surprisingly, is not actually creating new objects, but it's adding all those objects to a map, which is also causing a large number of feeb allocations. So that looks a little suspicious. We'll come back to that in a second. Let's first talk about how to lower or limit the impact of garbage collection on your system. First one we've been talking about for a while, lower the number of objects in your heap. This is going to reduce the amount of time it takes a garbage collection to scan your heap and therefore lower its impact. The second one is to reduce the rate of object allocation. And then the third one is actually to optimize their data structures to minimize how much memory they use, which will therefore reduce the need for more frequent GC triggers. So these three are ways that we can use to mitigate the impact of garbage collection, make our application more lightweight, and free up more resources for our program to operate efficiently. So let's dive a little bit into the first one. How do we reduce objects in the heap? So really the question is, how do you reduce long living heap objects? Because these are objects that are essentially living in the heap for a long time, and we expect them to keep living there, which means every single time the garbage collector runs, it needs to scan these objects, determine that they're still in use, and they can't be cleaned up, et cetera, et cetera. So rather than having these objects live on the heap, they can be created as values rather than references on demand. So for instance, let's take the Pinterest ad system as an example. If every single time that we're determining which ads to show a user, let's say we need some data for each item in that user request. So every potential ad candidate has some data associated with it. Rather than pre computing that data and storing it in this long lived map, we could just compute it on a per request basis to reduce the number of objects in the heat. So what that is going to do is increase the amount of computation for each average request. However, it is going to reduce the sort of like tail latency problem, because you have a very reliable measure of how much compute is being used per request, and it's easier to essentially optimize a particular request rather than optimize this long tail latency. So that's one way to do it, create your objects in demand rather than storing them in a long lived map on the heap. The second and third are very related, but be mindful of where you're using pointers. Go makes it really easy to create and reference pointers. However, if we have a reference to an object and that object itself contains further pointers or further references within it, these are all going to be considered individual objects in the heap, even though they may be nested together. The reason for this is, if you think about it, I have a pointer to some object x, or let's say the object is a person is of type person. Each person has a name, each person has an age, et cetera. If I have a pointer to the person's name and it's referenced somewhere, there's a good chance that the name may be used even after the main person object ceases to exist. So the go memory allocator needs to store that object separately in memory, which means it's a whole second object that needs to be scanned by the garbage collector later on. So reducing the number of pointers that we use, reducing the number of nested pointers is going to reduce the number of objects that your garbage collector needs to scan. The third one is just sort of a gotcha. Strings and binaries are treated as pointers under the hood. So each one is going to be an object in the heap. So wherever possible, if you try and represent these as other non pointer values. So strings, perhaps you could represent as integers or floats if possible, hashing them for instance, or representing dates as actual time time objects, so on and so forth. Those are ways to reduce the number of strings you're using, and therefore reduce the number of pointers. So going back to our example, if we look at line, if we look at line 272 37, we're creating a new catalog listing each time. And then on line 241, we're assigning it to a map. So we're using this catalog listing key, which we're doing by encoding product id and seller id together. Let's say this catalog listing key is actually a string object. If we then change how we're creating the key to instead using a struct. So lines 239 to 241 here show that we are starting to use a struct for the key instead, rather than using a string as previously, we can see that we reduce the number of heap objects by 26 million between these slides, which is around 20% of our heap usage. So we didn't actually change that much, we just changed how we're representing the exact same data, and we're able to significantly reduce the amount of work that our garbage collector needs to do. So here's one example of how a simple thing like removing strings can actually have a very significant impact on your application's heap usage, and therefore its performance. So the other thing you can think about is reducing the rate of allocation. So if your program tends to create a large number of short lived objects in bursts, object pooling is something that might benefit you, because you can use that to object pools can essentially be used to allocate in free memory blocks manually and reduce the number of GC, the amount of work that your garbage collector needs to do. Because object pools are expected to be retained for a longer scope, we don't need to keep allocating, clearing up these objects, and GC doesn't scan it over and over again. However, I will put out a warning here, because the garbage collector is not going to scan and clear up your object pool for you, it can lead to memory leaks if not used properly. So I'd only recommend using this if you know what you're doing and if you've exhausted all other options. For instance, if you're continuously allocating new objects rather than reusing objects from the pool, this could lead to a memory leak and cause your application to crash due to out of memory errors. A second potential problem here is if you're not properly sanitizing your objects before returning them to the pool, data may be persisted beyond its intended scope and could potentially be leaked to other scopes. So if we're storing some sensitive, personally identifiable information on a per request basis for each user, and we're using pools to represent the user object, if we don't sanitize that data, then there's a good chance that we could potentially leak data from one user's profile to another user's profile, which would obviously have really disastrous consequences, not only in terms of our application itself, but in terms of the user's privacy concerns, et cetera, et cetera. So these are the risks of object pooling, but it can be a really powerful tool to reduce the amount of work that your garbage collector needs to do and give you some more control over memory management yourself. The third thing that we talked about is thinking about how we organize and represent our data to reduce the amount of memory that it's using. So one way to do this is to clean up any unused data fields. Basic types in Go are going to have default values. For example, a boolean is going to default to false, an integer is going to default to zero, et cetera, et cetera. So even if you're not using these fields, the go memory allocator still needs to allocate space on the heap for these objects, and they're there for consuming memory. So fields I through L here are unused, but they're still taking on their default values. So if we remove those, we essentially went from 64 bytes to 40 bytes, which is a pretty significant win if you think about the number of objects that you might be storing on heap on a very large scale application. The other side benefit of this is that you're actually simplifying your code and making it easier to understand and reducing the amount of errors that might come up from someone who misunderstands what a field is in the future. This 1 may be a little familiar to folks coming from A-C-C plus plus background, but the ordering of your fields can actually really impact your memory usage as well. Stago memory allocator does not optimize for data structure alignment. So in this case we have two objects with completely identical fields. They're just ordered differently. The way the memory allocator works is it goes down the fields, allocates them one at a time. So in order to respect word alignment, it might need to add padding to the data in memory. So going through here, going to the bad object, starting with field a, it's a boolean, which is one byte. So it allocates one byte in memory, and then it needs to allocate eight bytes for field b, which is nn 64. Now, if it allocated those eight bytes right after, it would break the system's word alignment. So therefore, it needs to pad on seven bytes first, and then allocate the next eight bytes for field B. And you can see this goes on. So for field c, it allocates one byte, and then field d is an n 32, which means it needs four bytes. So it pads in three fields and then adds in field d, so on and so forth. If we simply reorder these, as we did in the good object on the right, you can see the memory allocation is much better aligned. And we went from having an object that consumes 40 bytes to an object that contains 24 bytes. So we did two things here. We just removed unused fields, which is great, and then we rearranged the remaining fields that we actually need, and we went from 64 bytes to 24 bytes, which is a 62% drop in the amount of memory used per object. Think about, again, a large scale system with thousands, millions, or even billions of such objects in use. This simple method could just really reduce your memory usage and improve your system's performance. So, to conclude, the Go garbage collector is highly optimized for most use cases. It's a fantastic piece of technology, and most developers do not need to worry about how it's implemented and don't need to worry about its performance. However, for some heavy, very large scale use cases, the garbage collector could cause pretty significant impact to your program's performance. And in this case, having an understanding of how the GC works, how memory management works, and then understanding some of the built in tools that the Go team provides, can be really, really important to understanding and reducing the problem. From there, we have a lot of options to actually optimize our system, improve performance, and have much happier users and much happier engineers. So three steps. Start with observing. We have some ways of knowing intuitively that there are certain systems, like really high tail latency, that might be caused by GC. From there, we go in and add some measurement. We can look at heap usage. We can look at GC trace output, et cetera, to try and narrow down whether GC actually is the problem. And then from there, we talked about a few different ways by which we can start to optimize our system. That's all I have for you today. Thank you all for listening. I hope this helped you understand how guard garbage collection works and go. And how you can go about optimizing your system to minimize the impact of the garbage collector. Thank you. And if you have any questions, feel free to reach out to me. Have a great day.
...

Nishant Roy

Engineering Manager @ Pinterest

Nishant Roy's LinkedIn account Nishant Roy's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways