Conf42 Enterprise Software 2021 - Online

Debugging JVM performance issues

Video size:

Abstract

With heavy advancements in Java, the number and scale of deployments are ever increasing. With such growth, understanding how to debug memory and CPU performance issues related to the java application and JVM are critical to ensure smooth operation.

This presentation will talk about performance issue scenarios and provide an overview of tools/scripts/processes to triage and eventually understand the problem to a level where the next step would be to fix it. We will go through heap dumps, thread dumps, process crash logs etc

Summary

  • Today we'll be taking a session on debugging JVM performance issues. It's not meant to be a comprehensive guide, at least not yet. What it does not cover are detailed instructions on how to use those tools. We would appreciate questions, feedback, or any other comments.
  • JVM stands for Java virtual machine. It's a lightweight virtual machine that runs Java applications. Its memory is divided into five sections. The garbage collector is a tool that collects and use Java objects. There are three ways in which the garbage collector can do it.
  • Zero garbage collector creates a stopped wordpause when marking and collecting. Another implementation of the garbage collector is a concurrent mark and sweep. This is a multi thread concurrent marking and sweeping GC. In this next implementation it will prevent memory fermentation.
  • We're going to look at some performance issues that we think are very common. When there's a thread that consists more cpu compared to other threads in the same application. One way to debug this scenario is by using jstack. Another way is by facing the Java fly recorder and mission control.
  • The second scenarios that we're going to look into is whether is a high gc activity or a high garbage collection activity. We want to see the difference between the minor GC versus the major GC. Now I will give the presentation back to Vania so he can spray to the next scenarios.
  • A high memory usage in a Java application can lead to an out of memory error. There are several ways to go about trialing this one such way is to use a tool called JMAP. Using a heap dump generation through any means should be done with a bit of caution.
  • Using the flight recorder and mission control, we can take a look at how memory usage threads usage and so and so all the instrumentation data for that application in that duration. A word of caution again, generating the heap dump halts the availability of the application so it should be used with caution.
  • Another aspect of information that we can get using Jwyder is basically the memory usage trend. We can see how the committed used and the maximum memory values have fluctuated over time. This is also useful in observing what kind of fluctuations are happening.
  • Whenever a java application crashes, it generates gist of the crash. A gist of a crash would look like this. It contains four key things. The reason for the crash, followed by the Java version that the application was running in when it crashed. And last but not least, where the crash log or the error report can be found.
  • There three primary symptoms, a high memory usage or a high cpu usage, or an application crash. The steps to triage that would be to first look at the XMX for the maximum allocatable heap size. Then look at those parts of the code that are key contributors to the memory usage.
  • High garbage collection activity often manifests itself as one of two symptoms, either a high memory usage or a high cpu usage. The third and final symptom is an application crash. This is accompanied by a gist of the crash, a core dump, and or a crash log.
  • JVide is an open source library that lets us get all the information or metrics or diagnostic information we need. The third milestone which we'll be embarking on is fine tuning the JVM and looking into actually fixing the performance issues. After this, we also want to investigate other reasons for the application crash.
  • Now that's done, we are in the questions section of our presentation. Since this is a recorded video, we would greatly appreciate if you have any questions or feedback to reach out to us using the email addresses provided in the description below. We would love to see you now in the presentation.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, welcome to our presentation today. I'm Ranjan, a software developer, and I'm Sylvia, my student at Carleton University. Today we'll be taking a session on debugging JVM performance issues. The goal of this presentation is to provide an overview of performance issues that we have encountered over the past few years in the context of Java applications and how to go about debugging them. A few disclaimers before we proceed this is a fragment of our experience. It's not meant to be a comprehensive guide, at least not yet. It covers a brief overview of the memory layout and the garbage collection process, followed by the triage process for the performance issues and aspects of the tools that are used in the triage process. What it does not cover are detailed instructions on how to use those tools. Links are provided for that purpose, followed by fine tuning the JVM performance nor fixing the performance issue. Those are topics for another day. The agenda of the presentation can be broadly classified into three portions. The first part is about a JVM internals recap, followed by the performance issues that we'd be mentioning and talking about as to how to go about triage. And then we'd be wrapping it up and concluding in the JVM internal section. Sylvia will be talking about the memory layout followed by the garbage collection process, and then she'll move on to the performance issue section where she'll be talking about a high cpu usage and a high garbage collection activity scenario. And then I'll take over and we'll be talking about a high memory usage and an application crash scenario. Then we'll summarize the triage process for the aforementioned performance issues, followed by the future work we have on our mind. There's a question slide which contains links to wikipages that talk about all the topics mentioned in the slides in much more detail, as well as the open source library that we'll be talking about in this presentation. So any questions, any concerns at that point of time, feel free to reach out to us. Our email addresses will be provided at the description of the video, so we would appreciate questions, feedback, or any other comments that you may have. Now, I'll hand it over to Sylvia to start with the JVM internals recap. So thank you Ranjan for the introduction. I know many of you already know what JVM is and how its memory is divided into. For those who don't know, I'll give you guys a brief explanation on what it is. JVM stands for Java virtual machine. It's a lightweight virtual machine that runs Java applications. Its memory is divided into five sections, so the first section is the prom counter or pc register, and this is where we store the address of the current instruction being executed. Next we have the stack, and in the stack we have the stack frames. When a method is called a new stack frame will be created for that method, and each stack frame contains the local variables and the local variables are stored in an order set and depending on its type, it's going to be stored into one entry onto two entries. So for example, if it's going to be an integer, it's going to be stored into one entry, but if it's a long or a double, it's going to be stored into two entries. And inside the stack frame we also have the operand stack and as its name suggests it's a stack and this is where the intermediate operations are going to be stored. Finally, we have the frame data, and the frame data contains the references to the constant pool and the exception table. So the constant pool is a table that contains information about the values of the constants and as well as their types. And the exception table is a table that allows the JVM to know where an exception can be thrown and where it can be handled. Then we also have the native method stack, and the native method stack is similar to a stack, but it will behave differently depending on the native language. So if the native language is c, then the native method stack will behave like a c stack. Inside the Javier memory we also have the method area, and the method area contains class level information, so that includes the static fields as well as the code method. And this is where the constant pool is going to be located. The method area is logically part of the heap, and the heap is where we store the Java application objects. It is divided into three sections. The first section is the joint generation, and the joint generation contains objects that are newly created. It has two parts. The first part is Eden, and Eden is where the newly created objects are going to be located. And then we have the survivor spaces, and the survivor spaces contains objects that have survived at least one DC cycle or garbage collection cycle. Inside the heap we also have the old generation, and the old generation is where the objects that have survived multiple GC cycles are stored. Then we also have the metaspace, and the metaspace is where we store class level information. And this is where the method area is located. When the hue runs on memory we need to have a process that would allocate and use objects, and this is where the garbage collector comes into play. So the garbage collector is a tool that collects and use Java objects. It works by first marking and then collecting. When the garbage collector marks, it will try to find all the reference objects and then mark them as alive. It will start looking from the easy roads, and the easy roads are objects that are referenced by the JVM at that specific moment. So that includes the static objects. Once the garbage collector finds all the reference objects, it will try to collect all the unreferenced objects. And there's three main ways in which the garbage collector can do it. The first way is by sweep. So when the garbage collector swips, then we just deallocate the memory. So that can cause heap fragmentation. And hip fragmentation occurs when we deallocate memory. And that can lead to small gaps of free memory in between used memory. And this is a huge issue for the performance of the application. So a way to prevent this, the garbage collateral can also sweep and compact. So after reallocated the unused object, you can just compact the reference object into a contiguous block. Another way to prevent hue fermentation is by copying. So when the garbage copies, it will copy all the reference objects internal region and then it will delete the old region. So the garbage collector can process different parts of the heap and depending on where it process, it can be called by different names. So the recruiter can be called minor GC if it process the joint generation. So the minor GC is triggered whenever the joint generation runs out of memory. And this is the most frequently called GC. And that comes from the idea that the most objects have a short lifespan, so it will be more effective for the application that the minor GC is run more frequently compared to the old gcs. Then we have the major DC, and the major DC occurs in the old generation. This can be triggered when the old generation runs out of memory, or it can also be triggered by the minor DC. Then finally we have the full DC, and this occurs when the garbage collector process the entire heap. So before I continue with a different implementation of the garbage collector, I have to talk about what stop the word pause is. So stop the word pause is triggered by the GC, and when that occurs, all the Java application threads will be frozen so they won't be run. So the garbage collector can process the heap without worrying about any change made by the Java application. Now that we're going to discuss about the different implementations of the garbage collector here, we're going to discuss about the four most important ones. So the first one that we're going to look into is zero garbage collector. So this is a single thread garbage collector that creates a stopped wordpause when marking and collecting. Looking at this example, we have three Java application threads that are running and once the heap runs out of memory the zero garbage collector will start and this will trigger stop the word pause and after the GC thread terminates, then the Java application thread will resume their execution. Then we have the parallel GC and this is similar to the zero GC, but it is different in the sense that instead of being single thread, it's going to be multi thread GC. Looking at this example, we have three Java application threads that are running. Once the Hebrew runs out of memory, the parallel GC is going to start and this will trigger stop the work pause. And once all the GC threads finish, then the Java application threads will resume the execution. Another implementation of the garbage collector is a concurrent mark and sweep, or CMS for short. So this is a multi thread concurrent marking and sweeping GC look in this example we have three Java application threads that are running and when the CMS is triggered it will trigger a stop the word pulse for initial marking. So during the initial marking, the garbage code will mark all the easy routes. So as I explained before, the easy routes are objects that are referenced by the JVM at a specific moment and that includes the static objects. Once the initial marking is done, then the application thread will resume, while at the same time the easy threads for marking are also executed. After a period of time. Then the CMS will trigger stop the word pause for the final marking, and during the final marking the etC will ensure that all the changes made during the concurrent marketing are also detected. Once the final marking is done, then the application thread will resume the execution, but at the same time the ETC threads will sweep the heat. One of the main downsides of the CMS is that it's not very good at preventing heat fermentation. So when that occurs, the CMS needs to trigger stop the work pause for compacting, and during compacting the CMS will compact all the reference objects into a contiguous block and after compacting, the Niger application thread will resume the execution. So as I trained before, the concurrent market suite is not very good at preventing fermentation. So in this next implementation of the garbage code it will prevent memory fermentation. So garbage first is a multitrack concurrent market and copying GC that collects the garbage first. So what this means is that the garbage first will process the regions that has more garbage or unreferenced objects first. One difference between this GC compared to the previous ones is that it divides the heap into regions of equal sizes while also maintaining the old generation and the joint generation. Looking at this picture, this is how the garbage first look at the heap. In this example, we want to process these three highlighted regions. What does the garbage first will do? It will copy all the reference objects in those regions and then move them into where the arrow is pointed. So to the blue region, and once all the reference objects are copied, then the old regions will be deleted. Looking at this other example, we have three Java application threads that are running and when the garbage first starts it will trigger stop the word pause for the initial marking, and after initial marking. Then the Java application threads will resume their execution while at the same time the garbage first thread will stop, concurrently marking, and after a period of time the garbage first will trigger a stop the word pause for the final marking, and once the final marking is done, then the garbage first will copy at the same time the Java application will run. So now that I've talked about the different implementation of the garbage collector as well as how the JVM memory is divided to now we're going to look at some performance issues that we think are very common. So the first scenarios that we're going to look into is when there's a thread that consists more cpu compared to other threads in the same application. So for this scenarios, we have created a multitread program with a thread that consumes a lot of cpu compared to other threads in the same application. In this code sample, we have the thread that will consume the most cpu, and this thread will increase the counter continuously. In this order sample code we have the normal thread and its behavior is similar to a cpu hover thread, but instead of increasing the counter continuously, it will slip in between iterations. We will be using the normal thread as a baseline to compare to a cpu hover thread. For this scenarios, we have created three normal threads and one cpu hover thread. One way to debug this scenario is by using jstack. So jstack is a command line tool that gets information about the running threads. It's included in Oracle, JDK, and OpenJDK. We have retrieved three samples using jstack, and each sample was retrieved every 30 seconds. So the first sample that we have collected, we have collected it at the 11 seconds mark. Then we have calculated the normal first average cpu time and the cpu hover cpu time. Just by looking at this data, the cpu hover cpu time is much higher than the normal threads average the second sample we have collected is at conf fourty two seconds mark. And by looking at the data from in the second sample we have calculated the normal thread average cpu time as well as cpu horizon time. And still the cpu horizon time is still much higher than the normal thread average. The third sample we have collected it at the 72 seconds mark. And just by looking at the cpu time for a normal thread compared to a cpu hover, the cpu hover is still much higher than a normal thread. And yes, by looking at the data collected throughout the whole samples, we can clearly see that a cpu hover is a thread that consumes the most cpu. Another way to trash this scenario is by using visual VM. So visual VM is a visual tool that allows you to get information about the Java applications that are running. So looking at this screenshot, we can see the running time of the cpu horror thread and the normal threads. And the cpu horror thread has a longer running time than the normal threads and this can be screened by the state of the thread throughout the sampling. So the cpu hover thread has been running continuously, which is denoted by the green bar, while the normal threads have been sleeping throughout the whole sampling which is denoted by the purple bar. Another way to try this scenario is facing the Java fly recorder and mission control. Java Fly recorder is a tool that allows you to get information about the Java applications in the JVM. A mission control is a visual tool that allows you to see the output from the Java fly recorder. We have used Java fly recorder to record the first 60 seconds on the application, and on the right side of the screenshot we can see the state of the normal thread and also of the cpu hover throughout the first 60 seconds. While the normal threads have been throughout most of the sampling, the cpu hover has been running continuously. Another way to try this scenarios is based in Jwadia. So Jwadia is an open source library that gets information about the JVM, and in this sample code we're getting information from Jwadia about the threads, and then we print that information as shown in the sample output. I also need to mention that Jwadia can get more information about the threads, but for the purposes of this presentation, we're only showing a portion of the data. But if you want to know what other threat information jy can get, you can just look into the link at the end of this presentation or at the description of this video. The second scenarios that we're going to look into is whether is a high gc activity or a high garbage collection activity. So for this scenarios we have created a program that will create a large number of objects and a portion of those objects are going to be referenced and the other portions are going to be unreferenced. So the object that we're going to create is called number and is around 16 bytes in size. In this sample code we're creating a large number of number objects and then we're only adding one fifth of them to a list. So the reason that we're doing this is because we want to see the difference between the minor GC versus the major GC. For this scenario we have created 500 million number objects. We have also used JSTA to get information about the etc activity. We have collected a total of three samples, and each sample we have collected it every 30 seconds. The first sample we have collected it at 25 seconds mark. And if we only focus on one EC thread, in this case the EC thread number zero, we calculate the cpu time as well as tp usage. So the second sample we have collected is at the 32 seconds mark. It has been around 32 seconds since the last retrieval. So if we focus on the EC three number zero, we can see that the cpu time has increased as well as the cpu usage. So the third sample we have collected is IB 89 seconds mark. It has been around 31 seconds since the last retrieval. And if we focus on the etc three number zero, the cpu time and the cpu usage have increased, but the cpu usage is slowly starting to maintain around 96%. We also can use visual vm to get information about the etc activity. So on the left side of the screenshot we can see the CPU usage as well as the eTC activity. And on the right side we can see the memory allocation. So if we use these two graphs we can see that as more memory is used, the EC activity will be steady but it will lower after completing the etc cycle. We also use the Java flight recorder and mission control to get information about the DC activity. We have recorded the first 6 seconds of the application and look at this screenshot of the javafly recorder. We can see the memory allocation which is denoted by the purple lines, and we can see the etC activity which is denoted by the orange region. And by looking at this data we can see that as more memory is allocated, the longer the etC activity is but the less frequent it is. We can also use Jwadia to get information about the eTC activity. We have used this code sample to get information about the EtC threat. Using Jwadia with the information we have divided it based on if it's going to be in the young generation versus the old generation. And using this data, we have plotted this graph of collection count versus timestamp. And by comparing the minor EC versus a major EC, we can see that while the minor EC has increased continuously, the major DC maintains at around zero during the first half, but is also increased from the second half. We also have plotted the speed time versus timestamp using the data retrieved from Jwadia, and by comparing the minor EC versus the major, etc. We can see that the behavior for both of them is very different, but the minor EC has increased continuously until it reaches a maximum 5 seconds. The major EC maintains at around 0 second during the first half and it starts to increase from the second half until reaching a maximum of 43 seconds. And the reason of this behavior for both the previous graph and this graph for the major TC is mostly because the old generation is running out of memory. So the major is triggered in this region. So now I will give the presentation back to Vania so he can spray to the next scenarios. Thank you Sylvia. So moving on to scenario three, a high memory usage. Please take a look at this snippet of code. Try to pause and see if you can find out what's wrong or what is causing the issue in this piece of code. There are two key aspects over here. One is we have a method that generates a random string of a specified length, followed by a main method which calls it and generates a string, a random string of a 32 million characters in count. So in Java, since we know that each character takes four bytes, we can make an estimation saying that, okay, if we were to generate 32 million characters, it'll take at least 32 million into four, which is 128 megabytes. So that's the minimum amount of memory it would take. We still haven't considered all the local variables or helper variables or other memory locations used for calculation as a part of this. So we can say with confidence that at least 128 mb will be needed. What happens if we run the same program with an XMX value of 64 megs? So we're saying that, okay, the maximum allocatable heap space is 64 megs. Based on our calculation, we know for a fact that at least 128 megabytes are needed. So what do you think will happen? The JVM will just go boom. It'll cite an out of memory error and it will say that hey, I'm not able to allocate the necessary memory to do what you asked me to do. So there are several ways to go about trialing this one such way is to use a tool called JMAP, as mentioned by Sylvia that's included with the jdks. And as long as we know the process id of the Java application that has a high memory usage, we can use the Jmap command and generate a heap dumps from it, which is nothing but a snapshot of the heap memory at that point in time. A word of caution, whenever a heap dump is generated, the JVM will be haunted, as in all the threads will be suspended so that the state of the heap memory can be copied over to a binary file. So sometimes all the time this causes a slight lapse in application availability. Sometimes it may take a bit longer than anticipated. So using a heap dump generation through any means or any tools should be done with a bit of caution. Once the heap dump is generated, you will see a H Prof. File which denotes the generated heap dump that can be opened with tools like memory analyzer, eclipse memory analyzer to see what it contains, how the memory heap memory looked at looked like at that point in time. So you could see what objects occupied most of the memory, what threads are referencing those objects, and then go to those code parts and see if you could make any optimizations to make it use lesser memory. Another approach is to use a tool called visual JVM. This is a tool that connects to the Java process and gives you live thread and memory usage information. So what you could do is when the memory usage is high, you could try connecting to the process and see what threads are running at that point and try to correlate. So when you know what threads are running at that point, you could look at the stack traces of the thread, what is a code path that the thread is following, and investigate that to see what could have caused a high memory usage over there. And if you want to go a bit deeper where the memory usage is high, you could also generate a heap dump by a click of a button in the store and analyze that heap dump to look at objects that are hugging the memory or threads that are referencing the objects that are hugging the memory. So another situation over here for triaging this particular scenario is by using the flight recorder and mission control. As Sylvia had mentioned before, the flight recorder can be enabled in the JVM by adding a few VM options, by specifying what duration and things like that. So once it finishes recording the additional instrumentation or debug information, it generates a JFR file where it dumps all that information and collected, and that JFR file can be opened in tools like mission control, where we can take a look at how the memory usage threads usage and so and so all the instrumentation data for that application in that duration. So consider a scenario where we generate one string, one random string of 32 million characters, and we maintain the same string throughout the lifetime of the program. We don't dispose of it, we don't create any other objects, but we just have a single string that we keep referencing till the end of the program. In such a scenarios, how do you think the memory usage trend would look like? It would basically be a gradual rise in memory usage that denotes the portion of the code where we generate the random string as random characters are being generated. The peak of that slope indicates the point where the string has been generated, and then we see a dip, a dip over there. That dip basically denotes the time frame when the string has been generated. All the additional memory that has been used to generate the string is no longer referenced. So garbage collector kicks in, it cleans up all of that, and then throughout the rest of the duration of the program, we see the memory usage stagnant. So that denotes that that's the memory needed for the program to not only run, but also for the generated string to be stored throughout the duration of the program. And by looking at the garbage collection activity in that time frame, it is evident that that dip in memory usage is signified by an orange block, which denotes that garbage collection activity took place at that point in time. So a slight variation in that scenario where we're not generating just one string and maintaining it throughout the duration of the program. Instead, we're constantly generating strings and not using them anywhere. In such a scenario, strings are generated and dereferenced generated dereferenced. So the garbage collection should kick in and dispose of all the dereferenced or unreferenced strings. So let's see how the graph would look like. We see a gradual increase denoting the string generation process, and then we see a sharp drop from the peak, which denotes that garbage collection kicked in. Realize that the string that was generated recently is not referenced anywhere, and it can be cleaned up and it cleans it. And that is what we see as a shout drop. Since we constantly keep generating and then dereferencing, generating dereferencing, after which the garbage collector sweeps it, we see a sawtooth like waveform. And in this scenario, if you notice, just like the mission control graphs earlier, the peak sound of the same size. This is indicative that garbage collection doesn't kick immediately. As soon as an object is dereferenced that would tell us a lot of things, right? So it's not necessary that the garbage collection is periodic or immediate to act at all scenarios. There may be a lot of things preventing it from acting immediately when an object is dereferenced. It could be a sheer cpu thread contention, as in a lot of other threads are running, so the garbage collection threads don't have cpu time. Or it's just that the sheer garbage collection mechanism is like that, where it takes a bit of time to kick in in some scenarios. So if you look at the garbage collection activity over here, we can see thick orange lines corresponding to the drop in memory usage. This is indicated that garbage collection has kicked in at that point and cleaned up all the dereference strings. Now, moving on to triaging with Jydir for this particular scenario, with just about four or five lines of code, we can generate a heap dump, and this is in the runtime of the program. You could add code to your application to generate a heap dump if a memory usage exceeds a particular threshold, or if the user puts a particular input. So this is pretty useful as you don't need any other application to connect to it or to basically attach to it and generate a heap dump through that. You could just use your application as is to generate heap dumps when needed. This will also end up generating HProf file, which can be analyzed using the eclipse memory analyzer tool. A word of caution again, generating the heap dump halts the VM. It causes a momentary lapse in the availability of the application, so it should be used with caution. Heap dumps should be generated with caution another aspect of information that we can get using Jwyder is basically the memory usage trend. We can see how the committed used and the maximum memory values have fluctuated over time, not only for the heap memory, but also for the non heap memory. This is also useful in observing what kind of fluctuations are happening at what time it has hit the peaks, and then we could look at other logs in that time frame to see what has been running at that point in order to see what would have contributed to such a high memory usage. So for the scenarios where we are generating one string and maintaining it throughout the lifetime of the program, as we can see over here, the memory usage is constant, which is denoting that the single string is occupying a constant memory throughout. No other memory is allocated or dereferenced for the garbage collector to clean. So the memory usage is constant compared to the graph in mission control where we saw a gradual rise, a small bump and then a constant memory usage. We don't see that slope or the small bump here simply because the time frame of this is much larger than what we saw in mission control. Because of that, the rapid rise in memory and the small drop is not as evident over here. Now moving on to the second variation in that scenarios where we say we're not generating one string and maintaining it throughout, instead we're generating dereferencing. Generating dereferencing. The graph is very, very similar to what we saw in mission control. Keep in mind all these graphs are generated by values taken from JydR, and in this scenario we can see it hit peaks and then it drops down when garbage collection kicks in. And as I mentioned before, all the peaks aren't of the same height, simply because of the fact that garbage collection doesn't always kick in when the Maheep memory has an unreferenced object. Sometimes it may take a bit more time to kick in. Moving on to scenario number four, or the application crash, please take a look at the snippet of code and try to pause and see if you can identify the reason for the crash. It does two main things. One is it tries to access a private variable within the unsafe class called the unsafe. And this variable can be accessed through reflection. That is the exact reason why we say f dot set accessible to true, which means that, okay, even if it's a private entity, let me access it through reflection. Once we do, we basically use that and we try to write to an invalid or inaccessible memory location. And that is what causes the JVM to go boom. It says, hey, I'm not able to access this memory location and I'm not able to do what you're asking me to do. So it doesn't know how to proceed with it, so it just goes and crashes. It produces a segmentation fault and it crashes. So whenever a java application crashes like this, it generates gist of the crash. And that gist is written to std out or STDR. And if you have redirected those streams to any file, you would find it in those files. A gist of a crash would look like this. It contains four key things. One, the reason for the crash, which in this case is a segmentation fault, or a SIG segvi, followed by the Java version that was running when in the Java version that the application was running in when it crashed, followed by if a code dump was generated. If so, where it can be found, and last but not least, where the crash log or the error report can be found. Now to dig in further to understand what part of the code might have caused the crash, we can look at the crash log, and we are interested in a very, very particular portion of the crash log. The crash log contains a lot of information. It's not only the JVM or the thread related information, but also system information, information on the system in which the application crashed. So we're going to look at two key aspects over here. The first aspect is basically the stack triage of the thread that caused the crash, followed by what threads were running at the time of the crash. When we look at the stack trace of the thread that caused the crash, we can clearly see that hey, crash main was called which ended up calling crash crash, which ended up calling a method with an unsafe, to be specific put address, and after which it ended up calling a sequence of native methods which caused a crash. So considering this, it is easier for us to go and dig into the code to see hey in this particular call stack, what could have gone wrong. Let's see what code was there, what would have happened before that code got executed, and what values could have caused this crash. So that is a very vital piece of information over here. Aside that we also have the list of running threads during the time of the crash. This is very, very useful because sometimes what happens is it's not necessarily that that thread that caused the call that made the crash was responsible for it. Some other threads that might have been running concurrently could have also caused a state change or some other value change which would have caused this thread to crash, to initiate the crash. So it's very useful to also know what threads were running at the same time or concurrently at the time of the crash. So using all this information, we investigate the code base and we see hey, okay, this is the path that triggered the crash, and what are the values that could have triggered this? And then we go about failing gracefully or writing code that would end up avoiding the crash. Or we make a conscious decision saying hey, okay, we want the application to crash in such scenarios. Now, now that we are done with this, let's summarize the triage process for all the previously discussed performance issues. It can be done so as follows. There three primary symptoms, a high memory usage or a high cpu usage, or an application crash where the application of rupee starts. So in a high memory usage scenario, we are interested in the memory usage trend along with the thread information. Thread information of the threads that have been running the trend on that. So the steps to triage that would be to first look at the XMX for the maximum allocatable heap size. And if based on our application design, we also have to understand whether for that load that the user is putting on it or the sheer data that we feed into the application is putting on it, that XMX value is a reasonable amount or not. If it is not a reasonable amount, and we know for a fact that hey, the load is very high and we are expecting this memory to be used for this kind of an input load, then we go ahead and increase the amount of heap that's allocatable. If that's not the case and we say hey, it shouldn't be occupying so much of memory for this load, then we go into that investigation mode. We try to correlate to see what threads are running when the memory usage is high, or if we get a heap dump. Even better, we look at what objects are occupying most of the memory and then we try to correlate it to threads that are referencing those objects. Once we are able to do that, then it's only a matter of code investigation. We have the path of the code that are key contributors to the memory usage. Then we look at those parts of the code to say hey, how do we reduce the memory usage of this area and come out of the possible fix? The other symptom is a high cpu usage where we are primarily interested in the thread information trend. And one of the first things we check is if the high cpu usage is caused by a garbage collection thread, one or more garbage collection threads. If it is, then we take the high memory usage route simply because an increase in memory usage is what triggers the garbage collection thread. And if no luck is yielded from that route, then we look at other potential garbage collection mechanisms. For example, if the application is using a concurrent mark suite, then we could think of possibly moving it to a g one to get a much much better garbage collection approach and lower the cpu usage. If it's not a garbage collection thread, then what we do is we look at the threads that have been causing the high cpu usage and what was the stack trace at that point of time when it was causing the high cpu usage? We investigate that code path and we try to optimize it so that it doesn't take up as much cpu time or it is bumped to a lower priority if it is not meant to hog the cpu. Now that's done, let's move to the third and final symptom, which is an application crash. The symptom of this is that the application of rupee stops and is accompanied by a gist of the crash, a core dump, and or a crash log. So in the crash log, when we take a look, we have so much of information, not only the thread that caused the crash, all the threads that were running during the time of the crash, information on the system in which it crashed, and so on. So we have provided a link that you could use in order to understand more as to how to use the crash log to triage this particular symptom. One other important thing I would like to mention over here is high garbage collection activity often manifests itself as one of two symptoms, either a high memory usage or a high cpu usage. Because only when the memory usage is increasing, garbage collection kicks in and tries to clean. And for some reason it's not able to clean because of a cpu contention or because of the fact that all the memory used are referenced. Then it'll try to be a bit more aggressive and it will spawn more threads and try to garbage collect more frequently. So that will cause a high garbage collection activity which will show symptoms of high memory usage or a high cpu usage. That's the reason why it doesn't have a separate row over here as it falls into one of those two rows. Now that we have summarized the triage process for the previously discussed performance issues, let's talk a bit about the current work that we have done so far, followed by the future work that we have planned. So right now we have summarized all the performance issues that we have encountered and identified root causes for them, and also documented the process of root cause. Even now that was done, we went ahead and created an open source library that lets us get all the information or metrics or diagnostic information we need to make that root cause much more easier, the root causing process much more easier, and that library is called JVide. The third milestone which we'll be embarking on is fine tuning the JVM and looking into actually fixing the performance issues. Not only fine tuning to be very specific, but also adding JVM options so that we get additional debug, logging or more information that will make milestone one much more easier and quicker. After this, we also want to investigate other reasons for the application crash. So some people may debate that application crashes aren't a performance issue, but it can also be counterargued, saying that it causes a lapse in availability when the application goes down. So in a way, you are degrading the performance of your application cluster. That's the reason why we have included that as a scenario for this presentation. So investigating other reasons for crash, such as out of memory error or garbage collection, overhead limit exceeded error, or so many other variety of reasons for which an application could crash, would be our next milestone. Last but not the least, we would want to make our open source library a bit smarter. The sheer amount of diagnostic information that this library can generate can easily surpass several hundred megabytes in a few minutes, so continuously transmitting that information over a network or storing it on the disk can become very expensive very quickly. So we want to make the library a bit smarter so that it only generates diagnostic information when it's needed. For example, when the cpu usage of certain threads exceed a threshold, say 70%, then we would want to store all the information on those threads onto a log, or onto a file, or transmit it over a network. Or if the memory usage exceeds a particular threshold, then we want to look at the memory usage trend, or the garbage collection trend, or even generate a heat down. So by making it a bit smarter in this aspect, we are saving quite a bit of data, collecting unnecessary information, as some of the detail of information are useful only when we want to triage performance issues. Now that's done, we are in the questions section of our presentation. Since this is a recorded video, we would greatly appreciate if you have any questions or feedback to reach out to us using the email addresses provided in the description below. And we have also provided links in this presentation to the Wikipages which talk in much more detail about the topics that we have mentioned in this presentation, along with the open source library JyBL, which we have created with the intention of collection of diagnostic data. So now that's done, we are officially finished with the presentation. Thank you so much for your time and patience. We would love to see you now in the presentation and hear from you if you have any questions or queries. Thank you. Yeah, thank you.
...

Ranjan Mohan

Open Source & Java Enthusiast

Ranjan Mohan's LinkedIn account

Silvia Siu Luo

Computer Science Student @ Carleton University

Silvia Siu Luo's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways