Conf42 Enterprise Software 2021 - Online

Demystifying Garbage Collection in Java

Video size:

Abstract

Are you curious to know whats happening under the rood of a Java application in terms of memory management? How objects are managed (allocated deallocated), how Java abstracts away the illusion of not needing an explicit memory management allocator? Well, this talk is for you.

Summary

  • IBM going to start with the garbage collection overview. Going to start slowly and then go into them the more details with the OpenJ nine GC algorithms. I'm going to finish off with advanced structure called double mapping.
  • Garbarge collection is a form of automatic memory management. The garbage collect attempts to reclaim memory occupied by objects that are no longer used by the application. All GC operations are completed in parallel, meaning multiple threads.
  • Instead of being stopped, the world now is going to be concurrent. Gencon concurrent scavenger is an improvement upon Gencon. The bulk of the work will be done concurrently with the application. Now we have a problem here, because the GC thread is trying to move objects while mutators are trying to access those same objects.
  • The road pause is divided into little increments. Only 30% of the time is going to be dedicated to the GC. A little bit of GC work here, here and here, and here until it's done.
  • The right barrier that I mentioned to you was snapshot at the beginning. It causes a lot of floating garbage because the barrier is actually performed before the store. If we didn't have a barrier, this will be the resulting object graph and we would erroneously collect object D.
  • DC policy is balanced. PGC stands for partial garbage collection. GMP is a global marking phase. It updates PGC information so that PGC can do its work more effectively. Balance provides significant rendition.
  • arraylets are a cleverer way to store larger rates in a region based GC. This array that's divided into regions cannot work with APIs that require a contiguous representation of this array. That's where double mapping comes in. What's next? Off heap where we'll store large objects at a secondary heap.
  • This is a quick GC policy guide. Different policies are going to have different throughput and different gcpuzzles. There's no perfect fits, no perfect GC that fits all workloads. If you want a GC that has perfect throughput and you don't care about puzzles at all, you should go with the perfect stop.
  • Verbose GC option is that you can actually visualize how your memory is behaving under an application with a tool called GcMV. If you're running with Openj nine, you can capture the lock of the GC and visualize that with GcMB. I'll stop here and check it out for any questions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Good morning, good afternoon, good evening everyone. Welcome to Conf fourty two or Java. Today I'm going to be sharing some really cool stuff about garbarge collection, so let's get on to it. These are some important disclaimers saying that this prestigious for information purposes. So this is the rough outline of what I'm going to be going through with you today. IBM going to start with the garbage collection overview. Going to start slowly and then go into them the more details with the OpenJ nine GC algorithms and I'm going to finish off with advanced structure called double mapping. A little bit about me I'm a software developer here at IBM. IBM almost for almost two years. I graduated from University of Waterloo. My master is in 2019 and one of my interests are systems, compilers, machine learning and AI. And a fun fact about me is that I'm a tennis aggregate. All right, if you want to get the builds for either hotspot OPG nine, I recommend going to adopt OpenJDK. There you can get both builds and terms. So you sure whats you have the best jvm and garbage collection which is the main part of the talk. Objects nine is a fairly mature project. It went to the open 2017. A fun fact about it is that it started as an embedded jvm. If familiar with Java Macro edition. That's how OPG nine started, which is why it has some really good memory performance when it comes to memory and it's open community. We welcome anyone who's interested and wants to contribute to the project. All right, garbage collection. So what is it? This is a more by the book definition. Garbarge collection is a form of automatic memory management. The garbage collect attempts to reclaim memory occupied by objects that are no longer used by the application. That's opposed to other programming languages like C Plus plus where you have to manually allocated memory with new and malloc and free those memory with free and delete. So what are the responsibilities of garbarge collection? They are allocate memory, identify the liveness of data and everything else is going to be considered garbage. Therefore the garbage collection is going to reclaim. So these are the most important responsibilities of the garbarge collection positives GCs. They are automatic managed automatic memory management. So it frees the burden of the developer to manually have to deal with memory. Also helps reduce some certain categories of bugs like dangling pointers and double freeze. On the other hand, there is the negative part, right? It requires some additional resources like additional cpu cycles. More memory can cause unpredictable pauses. You will see that they are known as stop the road pauses. It may introduce runtime costs and application. Whats little control of when memory is reclaimed. Here are some of the digital algorithms. I'm going to be explaining each of these as I go through the presentation. One, and that I won't be talking about, is reference counting, but in simple terms. Imagine I'm an object and I have two objects that's pointed to me. My reference count is going to be two. As soon as these objects deletes the references to me, my reference counts goes to zero. So pretty much I'm counting how many references I have, and once that count reaches zero, I'm considered garbage. Therefore I can be reclaimed. That's reference counting in 30 seconds. All right, this is what I'll be going through with you today. These are the different policies regarding open, unite first one's op throughput, the stop the world pilot collector, the second Gencon generational copy collector. Gencon Cs is an improvement of a Gencon's apostles collector. Metronome is an incremental, soft, real time collector and balance, the region based collector balance is really close to g one and John cone is really close to cs from hotspot optroput. So optroput is a parallel garbarge collection. GC operations are completely stop the world pauses, meaning all the application threads are paused. Marksweep and optional copat collector. I'll explain whats in more detail a little bit. And all GC operations are completed in parallel, meaning multiple threads. Garbarge collection the heat and GC native memory overhead for macmap and work packets. These are metadata needed by the GC to collect some garbage. So whenever you start an application, we allocate objects in what we call the heat. Normally is a contiguous block of memory, but it doesn't necessarily whats to be, but over 90% of the time it's going to be a contiguous block of memory. So we have a flat heap, but for optuput we further divide into two logical spaces, small object area and large object area. Most objects are allocated in the small object area, so this is how an application behaves under this policy. The white arrows represent the application thread and the red arrows represent the GC thread. So application starts right, and as soon as we hit an allocation failure, we deplete the memory of the heap. We start a GC cycle. That's the red arrows there. So GC collects the heap while all application threads stop. That's why it's called stop the road, and then the application threads can continue from there. So the GC is divided into three phases, marking, sweeping, collection, so marking where it's where we find all the live objects. So just go through the object graph, finding all the live objects. Then we sweep the dead objects, meaning all those objects that are not marked are dead objects, therefore we sweep them. And then there's a compaction phase because the heap might get fragmented. So what does it mean, a heap to get fragmented? So imagine through the lifetime of an application when we have many DC cycles, our heap might get like a swiss cheese, meaning it's going to have very little small free spaces, but maybe not enough contiguous free space to allocate an object. Therefore we need to compact that heap. This is what this slide is trying to explain, where we take all those live objects, put in some part of the heap, and now we have all that contiguous memory that we can work with. Jenco if IBM familiar with cs from hotspot. It's really similar to the generational copy collector, where objects have edges provides a significant reduction in the assist of the repulse times. Introduces right barrier for a member set and we have a cocurrent global marking phase where now we're going to be marking objects concurrently while the application track is running. So here, instead of dividing the heap into small object area and large object area, we divide the heap into nursery and tenure. In other words, nursery is a new space, tenure is old space. Furthermore, we divide nursery into allocate and survivor, and all objects are actually locating allocated space. This is how it behaves though. Don't panic. So again, the white arrows represent the application threads and red, the GC thread. Scavenge here represents a copy collector. So what is a cop collector? Cop collector means it's copying objects from one space to the next. So it's copying objects from allocate to survivor. So when we hit the limit of the allocate space, we start the scavenge, which is top the word copy collector. And now we start copying objects from scavenge to survivor. That's what scavenge is doing. Global mark face is concurrent mark phase, where it's actually marking the tenure space, because scavenge is only dealing with a new space. So how do we deal with the old space? That's what global mark phase is doing, marking the tenure space concurrently. And at the end, at the global end, we sweep the tenure space to collect those original tenure space. So because of this concurrency, we need what we call write barriers. So imagine we are trying to store a reference into an object field like that, and imagine we have the following object graph. We are in the middle of marking objects and applications running. So GC has already processed the roots. GC has already processed object a. As you see with the green tick, GC is in the middle processing object b and hasn't seen c yet. Then this happens. Mutated thread comes along and deletes pointer from b to c and adds a reference from a to c. And then the GC thread comes along and okay, b doesn't have any pointers, so I'm done marking. Therefore this is my resulting graph where I found roots. Object a, object b, object C is normal, therefore it's considered garbage. So IBM going to collect erroneously. So that's a problem. So how do we remember, how can we remember object c? We do that through write barriers. So how this is implemented. So we have a simply check there if it's concurrent disease is active. We remember that object. So we put that in a car table so that at the beginning of the GC cycle we actually go to the car table to look for these. Sorry, it's not the beginning. At the end we go through the car table to remember these references that we might have missed. But we do need an extra barrier because we remember we have new space and old space, but we might have references going from old space to new space. And scavenger is not aware of that. So in order to capture these references coming from old space to new space, we need another right barrier. And this right barrier looks like this. All we do is check if a is in old space, a is tenured and c is not. Meaning a is an old space, c is in a new space. Then we also remember that since we have two barriers here, we can put that in the same check to save some cpu cycles there. And that's the resulting right barrier for Gencon. Cool. Gencon concurrent scavenger, that's an improvement upon Gencon. This is a puzzleless collector. If you're familiar with Shenandoah and ZGC, those are also puzzles collection. They are some differences there, but concurrent scavenger is very similar to Gencon here. The generation cup collector introduced a readbearer because now we're going to be moving objects concurrently. Not just marking, but also moving objects. The heap structure is the same, right? We have the new space, old space, allocate, survivor, nothing new there. So the biggest difference here is the scavenger, which is instead of being stopped, the world now is going to be concurrent. The CS there stands for concurrent scavenger. So application starts instead of having a stop the world scavenger. We're going to have two small pauses, one at the beginning, one at the end for like synchronization, marking the roots. And the bulk of the work is going to be done concurrently with the application, which is where you see the yellow arrows there which represent application threads running concurrently with the application. And global marking phase here is still the same, right. The only thing is that the cop collector now is running concurrently with the application. Now we have a problem here, right, because the GC thread is trying to move objects while mutator is trying to access those same objects. So let's step back a little bit. Let's think about the parallel scavenger, right? Parallel scavenger is a stop toward scavenger. So I have multiple GC threads trying to move an object, right? So we can deal with that with an atomic operation, cas cop and swap, right? So let's see what happens in this case. So we have two DC threads trying to move an object, right? The green and brownish thread here. So what would happen in this case? So they both try to copy the object, right. So they race to cop the object. So in this case the brown one wins the race. Copy the object, install the foreign pointer, the green thread is going to be. Okay, I lost all good. I'll just continue do my work and cop some other objects. So far so good. Now what if we had a mutator in a DGC thread trying to do that? So let's keep this same Cas operation and see what happens. Right. So we have the same cas there, but instead of two GC threads we have a GC thread which is the green one, and a motel thread. Okay, so let's see what happens. So in this case, let's assume the GC thread won the race which copies the object. And then the motra thread is going to come along say okay, I lost the race, but I see a foreign pointer is going to follow that to access that object in the two space. But since GC thread's not done copying the object, the motel thread is going to access some garbage memory and that's a problem. So how do we fix that? That's how we fix it. So what we do is we unconditionally copy the object, right? So we copy object. So both is going to copy the object and what they do is they erase to install the foreign pointer. So how does this work actually? So both is going to copy the object, right? So that's going to be the resulting at the beginning. Both is going to have their own copy. And here at this point they're going to race to install the foreign pointer in the foreign space copy. So in this case, the GC thread won. The race to install the foreign porter material thread is going to be okay, I lost, but no matter, I'm just going to go to the foreign pointer to access that object. But since the G thread have already copied the object, the object, everything is fine because the object's there. The GC thread is done on copying the object, so everybody's happy. So that's what we call a read barrier. You're familiar with Shenandoa version two? They use the same barrier as here. Okay, metronome. Metronome is an incremental, soft real time collector. It provides an upper bound on GC post times. What I mean by upper bound is that imagine a window of ten milliseconds on that window. We guarantee that 7% of that time is going to be dedicated to the application, while the other 30% is going to be for the GC. So it's up to the GC to do with those 30%. Whatever he wants to do can do some GC work or give some time back to the application. Here uses a different write barrier called snapshot at the beginning. And because of this barrier, it has a lot, a high percentage of floating garbage, and you're going to see why. So metronome heap is not divided into new and old space. We actually divide the heap into regions. Here is a fixed size region 64, bigger the heap, the more regions we have. Regions are assign a cell size. So here it's a very nice characteristic is that objects on the same region have the same size. And if an object is larger than 2, can actually span multiple regions with the exception of arrays. So if we have an array that's bigger than a region, we actually is going to divide whats array into chunks and put a chunk here, another chunk here, and have the array header points to these different chunks. Because normally if we have an object that's logging region, we're going to trigger a stop the world GC, right? For arrays, we can just chunk that array and put this in different chunks. That's what we call arraylets. I'm going to explain a little bit more what arraylets are at the end of the presentation. This is how metronome behaves. Very, very simple, if we will remember opt throughput. From the beginning of the presentation, it had one huge top to roll, right? So imagine that huge top, the road pause divided into little increments. That's what metronome is, right? It does a little bit of work here, a little bit of work here, but never going over that 30% threshold, right. Remember those ten milliseconds window, only 30% is going to be dedicated to the GC. And that's what these increments are. A little bit of GC work here, here and here and here until it's done. So the right barrier that I mentioned to you was snapshot at the beginning, right? And it causes a lot of floating garbage because the barrier is actually performed before the store. So imagine we have the following object graph. We are marking object concurrently, right? And the GCA has already processed the roots, has already processed object a, is in the middle of processing object b, it hasn't visited object c, neither object D. The mutated thread comes along and does this, right. Deletes the reference from b to c and adds a reference from a to d. And then if we didn't have a barrier, this will be the resulting object graph and we would erroneously collect object D. So how does snapshot at the beginning actually work? It is implemented this way, meaning we remember the reference before we actually do the store. So we put it in a temporary field. Remember that if the berries active. So if you come back to the object graph here, the berry is actually triggered. Whenever we delete the reference from b to c, that's when the berry is triggered. So instead of having a resulting object graph like this, we would have something like this. In this case we would actually remember object C, which that is considered garbage because we don't really need object C. It just happens to be in the path of object D. That's why we have a lot of floating garbage with this type of barriers. Now, our last but not least, DC policy is balanced. If you're familiar with g one from hotspot, it's very similar. Both are generational, both are region based. One difference that I see is that the g one from hotspot has only three ages and we have 23 to 27 ages. Whats we have a lot more ages. Like I said, they are both region based. Balance provides significant rendition. Max stop the world poses. We introduce right barriers similar to the one from Gencon, because you might have reference from different regions coming to my region. And we have an incremental heap defragmentation. So similar to metronome, the heap is divided into regions. But here the bigger the heap, the larger the region, because we try to have between 1000 2000 regions in this policy, objects are located in what we call addon regions, which is our new regions. And if an object is larger than a region, we just throw an out of memory error, with exceptions of arrays, same thing as metronome. We divide the arrays into chunks and put them into different regions. This is how a balanced application behaves. The balanced policy. PGC stands for partial garbage collection. It's a copy collector. So what it's doing is copying the live objects from one region to the other and tries to find those regions with the highest return of investment. So what does that mean? So imagine we have a region with only one live object. All we have to do is cop that object over. And there we go, we have an entire free region to work with. So PGC tries to find those regions which has the most garbage, which is going to do the least amount of work to free that region. And GMP here is a global marking phase. It doesn't collect objects, it actually helps PGC. So one problem about PGC is that it only has local knowledge about each region. As time goes on, it loses this local knowledge. So GMP is there to help update this local knowledge of PGC. And at the end of the GMP, it updates PGC information so that PGC can do its work more effectively. So GMP does not reclaim memory, right. It prefers a marking phase only. It's scheduled to run between pgcs and it gives an accurate mark map of the entire heap. And it's going to use this mark map to update the PGC. So this is more of a visual representation of how PGC actually looks. So on the X axis we have time and the Y axis we have free heap. So lines going up is the PGC doing its work and lines going down is the application consuming memory. So you can see this trend going down is the heap getting fragmented because of the PGC. And that happens because PGC loses this local information. So GMP is smart enough to trigger at some point here so that it finished in a timely manner to update PGC so that it can defragment the heat. So you can see around there where I have affected the fragmentation. That's where we just updated PGC information. And you see that trend going up. And that's when PGC has the most recent information of every region, so it can better collect and defragment. And with time that trend going down, it's going to start up again and GMP is going to kick in again. So you can update BGC. And that's how a balanced policy behaves. So in balance, we do have a right barrier. It's very similar to the Gencon barrier because we might have multiple references coming from different regions to this region, and PGC doesn't really know how to keep track of that. So we need a right barrier for that. So actually here we do, unconditional barrier if you think about it. So we just study the card whenever we set a field of an object and then at the beginning of the PGC we go through this card to see if there's an iter region reference. So you see here, whenever we think about, read and write barriers, it's always a matter of who should we put the burden on. Should we put the burden on the motel thread or should we put the burden on the GC thread. So here we chose to put the burden on the GC thread so that we don't penalize motel threads as much. Now, arraylets, so you probably heard me talking about arraylets a few times here. So arraylets are a cleverer way to store larger rates in a region based GC. So these arrays have a spine, right? And this spine is going to be pointing to the data allocated with this array. Each leaf of this array, which is are the different chunks of the array, consumes the entire region. So how does it look? Actually, if I were to graph forward to visualize this, so we have the heap here, and the bottom we have the array header. Arrayoids represent the pointers to the array data. So imagine if you want to access element one of this array. What we do is we follow reference, the first reference here, the first arrayoid, and you can see we go to the last blue region and access the element that way. So you can see here, it's really nice way to store large arrays in a heap as opposed to having this array span multiple regions. But there's a problem. This array that's divided into regions cannot work with APIs that require a contiguous representation of this array. So how do we deal with this before? So this is the case of Jni APIs, Java native interface critical APIs. They actually need a contiguous review of the array, which is not the case of array, lets. Right. So whats we do? So we actually make a copy of this array, right, and then copy element by element to this temporary array, pass that array, copy to the J nine API. Then the J API is going to do whatever it does with the array, maybe modify some of its elements, and then we need to copy everything back. So you can see there are like two, imagine we have large arrays like 1050 megabytes array. As you can see, that's really expensive. So how can we deal with that? That's where double mapping comes in. So we can make this large arrays discontinuous arrays look contiguously. So we take the idea that physical memory is limited, but virtual memory is not. On the contrary, virtual memory address space is very very large in 64 bit systems to the power 64 in fact. So what we do is we map two virtual memory addresses to the same physical address. And the nice thing about this is that any modifications to any of the virtual address is reflected in the other virtual address. So ZGC does a similar trick to keep track of the different faces of the GC. So every object is going to have actually multiple copies in the vitro address space, but they actually point to the same physical address. Here we just do it for arrays. How would this look like if you were to visualize? So in the bottom part there we have the actual arraylets, right, the different regions of the arrays and then we map that to a second virtual address and we do in such a way that the second vitro memory is going to look contiguously in the vitro space. All we have to do now is pass the address of that second virtual memory to the gen nine API and every modification to that virtual memory is going to be reflected in the original virtual address. That way we don't have to copy everything to the temporary array and copy everything back. So there's a really nice trick that we do in our region based DC and with that we actually boosted array operations by 30 times, which is pretty cool. So what's next? Double map has some downsize rights, only available Linux and 64 bit systems. But can we do better? Some platforms doesn't really support this double mapping trick. So what we're actually working right now is on a technology called off heap where we're actually going to store large objects at a secondary heap and that way we can do some more tricks and try to optimize this large object even further. So stay tuned. This is a quick GC policy guide. So on the left we have the GC policy and on the right we have some characteristics. For instance, some of them, some of these gcs are generational, right Gencon and Gencon cs and balanced. Some of them is going to have some concurrent faces. The only one that has no concurrent face at all is of throughput but of throughput has the highest throughput and that's because it doesn't have to deal with read or write barriers. And these different policies have different heap layouts, right? Gencon, Genco has the new and old separation right balance and metronoma region based and opt throughput has a small object area and large object area. This is another table where it shows for which domain is which policy is better. This is not, you shouldn't follow this by the foot, right? You should actually test the policy because even though you might have like a desktop application, depending on your workload, another policy might work best. So here for such a general tip. Gencon are good for web servers, desktop applications, balanced. RG one is good for very large heaps, and metronome is good for systems that have softer real time constraints. And optropod is very good for small heaps. Or if you don't care about GCP poses at all. And you can see here different policies are going to have different throughput and different gcpuzzles. You can see Genco Cs has not the best throughput, but whats the fewer GCPU on the other hand of throughput has the highest but the largest GCP. And that's because of concurrency and how we deal with that. That's because read barriers and write barriers has overhead on the GC. And you can see here all policies are going to have some type of additional memory. Maybe they have like a mark map or something to keep track of these objects, because we need a way to keep the DC needs these extra data structures to help collect the object. And that's what this middle column is telling you about. This is like a diagram showing the similarities between the different and the most common GCS policies out there. So on the top there in the yellow we see the most common gcs. So Gentcon is very similar to cms from hotspot. I think I've been saying cs, I apologize. So Gencons is similar to cms from hotspot. So both are gencon, both have neo and old space balanced, and G one very similar to each other. Right? Both are rigid based, both are generational, right? Then we have on the blue section there we have the puzzler gcs, which their objective is to have the least amount of poses and the shorter puzzles, right? So you can see the thickness of the lines there are thinner because even though they share some similarities, not as similar as the ones above, right? So ZGC, Chin and door, they're both region based, but they are not generational. Gen conference is not region based, but it is generational. Metronome is region based, but it's not generational. And remember, ZGC uses that multimapping kind of technology, right? Shindo version two uses the same read barrier as Gencon CS Shillando version one uses Brooks Pointer, which is kind of a different technology. And on the green side there we have Azul C four, which also puzzle legacy, but you require like a special hardware or some special software. Now this is the summary of the talk here. If there's one thing that I want you to get out of this is that there's no perfect fits, no perfect GC that fits everything, all workloads, right? And if you want a GC that has perfect throughput and you don't care about puzzles at all, you should go with the perfect stop, the wood GC, because no matter what, that's always going to have a higher throughput as opposed to a perfect puzzleless GC. And that's because the read and write barriers overhead, even though they're very important for the currency of these puzzles, gcs, there is an overhead of, because for every object read or write you need to make sure that the GC has that knowledge whenever we are changing those references along with the application, right? So if you don't care about the pauses, go for perfect stopwdc. If you do care about deposits, go for puzzles GC. But if you care about both, or you want to balance between the two, can go for the balanced GC or G one, right? Or Gencon Semas. So you have the options, right? Both extremes, perfect stop the road, perfect puzzlers and something somewhere in the middle. And these different policies deal with heap fragmentation differently, right? Depend they might have the heap configuration different, so therefore they're going to collect objects differently, right? Remember the barriers, right? Snapshot at the beginning, barrier against just a regular write barrier, right? So there are a lot of things here to playing when it comes to a perfect GC policy. But again, there's not going to be a perfect GC for every policy. You should, whatever workload you have, you should try three, four different policies and to see what's going to work best for that workload. So here are some links. OPG nine is a very cool project. We do recommend people to come check it out. That second link there, CMD line migration is for those people that are familiar with hotspot, right. And want to try out OPG. So I really recommend looking at that. Adopt opendicates, the website whats will get the binaries. And Omar is another project that has many different components to build a runtime. And Opgnan uses a lot of these components. For instance, uses GC component and compiler component. It's really cool. Also open source project and we welcome contributors there as well. So that's it for me. I'm very happy if you can connect with me LinkedIn, Twitter and before I open to questions, I want to show this book here. If you want to know everything there is about garbage collection, this is the book. Everything you want to know about is in there. And these are some of the common options that if you're interested to go through. If you want to start, like tweaking your application, you can start with these ones. They are most, let's say simple. The first one there is only available on Obajni. And the nice thing about verbose GC option is that you can actually visualize how your memory is behaving under an application with a tool called GcMV. And you can only visualize this if you're running an application with the Openj nine VM because with hotspot they have a different variables. So if you're running with Openj nine, you can use that verbals GC to capture the lock of the GC and visualize that with GcMB, which is really cool. If you remember that graph I showed you balanced, I use GCMB with that. I put some colors into it, but that's pretty much what you're going to get with GcMB, which is really cool. So I think that's it for me. I'll stop here and I'll check it out for any questions. Thank you so much.
...

Igor Braga

Software Developer @ IBM

Igor Braga's LinkedIn account Igor Braga's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways