Conf42 Enterprise Software 2021 - Online

Benchmarking the Warm-Up Performance of HotSpot VM, GraalVM and OpenJ9 -- A Learner's Journey

Video size:

Abstract

Are you new to the JVM? Did you just run your Java programs but never cared what the JVM does with your code under the hood? Want to learn about JVM internals, how to (not) write a Java benchmark test or are you simply curious about JVM performance? Then this talk is for you!

Summary

  • Frank Kriegl will talk about benchmarking the warmer performance of hotspot VM, crawl VM and OpenJ nine. The goal of the talk is to motivate you to start with java micro benchmarking on your own.
  • Warmup is defined as the number of iterations the JVM needs to increase the speed of method execution between the first and the nth invocation. There is a difference and what it looks like between JIt compilers and Aot compilers, and the code they produce.
  • The Java heap is separated into several parts. Young generation consists of Eden space and two survivor spaces. There are seven garbage collection algorithms. garbage collection can have unwanted side effects in performance testing. Luckily, there's the Java enhancement proposal 318, a no op garbage collector.
  • Next, it's about JIT versus Aot compilation. Java code is pre compiled to Java bytecode, which will then be run on any JVM. Head of time compiler will just directly compile all the bytecode to machine code when JVM starts up. For the jig compiler since JDK eight, there are five levels of compilation.
  • Writing a good benchmark is not easy. There are conceptual flaws when designing a micro benchmark. But there is JMH to the rescue, so these flaws can mostly be avoided. I tried several different approaches to write some good benchmark tests.
  • JMH uses a configuration of 5gb of heap and also provided the flag heap dump on out of memory error. Last line, I specify the number of iterations, which is 21,000 per fork. I run 20 forks and the timeout is just set to 360 minutes.
  • Both GraalVM and hotspot have this sudden decline at around 100 executions where the execution time significantly drops. The scatter shade is tightly following the median curve and also narrowing over time. If you have any guess what this bump is about, please let me know.
  • There are spikes in the execution time for single iterations of openj nine. I thought maybe these spikes could be cared by the missing pretouch option, which is not available in open gen nine for Linux. But in my test setup, fetching actual memory from the operating system had only minor or even no effect on the measurement series.
  • Opengen nine, hotspot and GraalVM, all in JIT compiler mode. One interesting thing to observe is the amount of time it takes to speed up the method execution from five milliseconds to 0.5 milliseconds. Why is the warm up as it is, and what's causing it?
  • The benchmark measurements were done for JDK version eleven for all the mentioned jvms. Opengenine's benefit is definitely its Aot mode. Many other JVM also deserve to be benchmarked on warmup performance.
  • It was the first talk I ever held in public. You can find many more details on my blog post about that topic. If you have any questions about my measurements, my talk, my learnings, or want to discuss something, just send me an email.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hello everyone, and thanks for tuning in to my talk at this year's Conf 42 Java conference. Today I will talk about benchmarking the warmer performance of hotspot VM, crawl VM and OpenJ nine. Let's get started. So, my name is Frank Kriegl. I live in Heidelberg, Germany, and I've been working as a Java developer since almost four years, and currently I'm also finishing my master of science in parallel to my regular employment, which is also where this talk originates from. Recently I had to write research paper and I was simply curious to learn more about jvms in general and saw this as a chance to deepen my knowledge. And that's why this talk's subtitle is also a learner's journey. So here's today's agenda. First, I will start with a brief introduction. Then I will set the baseline with some information bits about JVM internals. Next I will talk about my learnings when I try to compare the warmup performance of the three different JVM. So I will spend some words on the pitfalls of creating good or bad benchmark tests. Next I will describe my test adapt for benchmarking the warmup performance, and also mention some configurations I made for the jvms on the test. Finally, I'd like to present you my test results and for sure also give some interpretation on them. In the end, I will draw a short conclusion. The goal of my talk is actually to motivate you to start with java micro benchmarking on your own. So I hope that in the end of this presentation you will have a basic understanding of JVMs and are ready to get started with your own java or JVM benchmark measurements. So let's talk about the what and why. What is warmup actually? Usually, warmup is defined as the number of iterations the JVM needs to increase the speed of method execution between the first and the nth invocation by applying JIT compiler optimizations on the bytecode. Okay, that's the definition, and now let me show you what it actually means. So in this chart you see on the vertical axis the time per operation, and on the horizontal axis the number of iterations and warmup is the part here. From the first iteration, it takes quite long to complete one operation for the JVM one method execution, and over time it's getting faster. So after 200 iterations it's much faster than the first iteration. And this decline in execution time is called warm up. Next, I'd like to answer the question why? I would like to compare this. So mainly out of curiosity, to be honest. But there are for sure also some actual reasons. Like I was searching on the Internet and I only found little research in this area. So the most interesting article I found was don't get caught in the cold warmup your jvm and it's but hot tub which is a new JVM implementation to use pre cared JVM to avoid the warmup overhead. And so I thought, okay, why not doing my own research? I just wanted to see how much the jvms I wanted to test how they differ in the method, warmup speed, and eventually if there is a difference. Well, there is a difference and what it looks like between JIt compilers and Aot compilers, and the code they produce. Okay, next it's about setting the baseline. So on this slide you see a picture of the Java heap structure. The example is for the hotspot VM, and the Java heap is actually separated into several parts. So you see here there is the young generation and the old generation. And the young generation itself consists of Eden space and two survivor spaces. So how does memory allocation happen in the JVM? New memory is always allocated in the Eden space, and as soon as just before the Eden space fills up, some minor garbage collection occurs and the objects are transferred into survivor spaces. One survivor space is always free and the other one is occupied. And if one survivor space is running full, minor garbage collection will just swap these spaces, clear up unused objects or unreferenced objects, and the survivors will stay there. If there are long living objects that survive several minor garbage collection cycles, it could actually happen that some major garbage collection occurs and these objects are then transferred into the old generation space, also called tenured space. Next, and I already mentioned it, I briefly touched the topic of garbage collection. So meanwhile there exists, I think, seven garbage collection algorithms, at least in the version of Java eleven. Thing is that garbage collection can have unwanted side effects in performance testing, so you better try to eliminate that. Luckily, there's the Java enhancement proposal 318, which is about epsilon, a no op garbage collector. I linked it here, can read the details if you like. And that's actually a garbage collection algorithm which will always allocate memory but never freed up again. Next, it's about JIT versus Aot compilation. So as you might know, Java code is pre compiled to Java bytecode, which will then be run on any JVM. The JIT compiler, just in time compiler is first doing some profiling on the bytecode and then it will apply optimizations like method inlining, branch prediction, loop unrolling, dead code elimination, and many more. And it will also only compile parts of the bytecode to machine code because it has to decide which parts of the code need to be optimized. Then on the other hand, there's the head of time compiler, which will just directly compile all the bytecode to machine code when JVM starts up. So for the jig compiler since JDK eight, there are actually five levels of tit compilation. At least that's what applies for the hotspot VM. The first level is just about interpreting bytecode, so it's level zero, and the JVM will not compile anything at all, but just run as an interpreter. And after a few iterations the chit compiler will make use of its first compiler. It's the C one compiler, also called client compiler, and produce some simple c one compile code. So we talk about level one, two three compilations, which are all done by this C one compiler. After about 10,000 invocations, code will eventually become marked as hot, and then it will become subject to level four compilations, which is then done by the c two compiler. This is cared a server compiler and it will do some much better optimization with your Java bytecode. Okay, next we continue with Java micro benchmarking. My lessons learned so I was not sure in the beginning how to start my learners journey. I searched on the Internet and found that there are existing benchmark suites like spec JVM 2008, which is from 2008, and the Decapo benchmark suite, which was first released in 2009. While the last maintenance release of the cared benchmark suite is almost two years ago, which was eight months before the release of JDK eleven, for me they felt quite outdated, so I didn't want to use them for that reason. Also, not all benchmark tests were working with the targeted Java version eleven, so I was actually trying to use them, but failed. And finally the output format. The measurement units did not suit or run it in a suitable format, which I could use for further analyzing the collected data. So simply using some out of the box benchmark suites did not work for me. So I came up with the idea of writing my own benchmark. You have to know, writing a good benchmark is not easy. There are two fault categories. On the one hand, there are conceptual flaws when designing a micro benchmark, which I will show you an example in a minute, and on the other hand, there are contextual effects when running it. Here is an example for a conceptual flaw. On the left hand side we have the method create arrayupto and the method that code elimination, which will invoke the first method to create an array with the length of 21,000 containing values from one to 21,000. The array is then processed and all the values are accumulated into the result variable, but this variable is actually never returned, so the calculation result is not used at all. If we then execute this for like 18,000 times, invoke system current time millies before and after the method invocation, we could calculate the duration it takes to execute that code elimination method by subtract the start value from the end value. But here's the issue. When running the code, the JVM will first just interpret your method and eventually collect some profiling data on it and figure out that the result of the method is actually never used because it's never returned. So at some point in time the JVM will just eliminate this invocation and you'll see that in your output that at some point in time the execution time will just drop to almost zero milliseconds because what you measure is just the time between invoking system current time release the first time and the second time. But there is JMH to the rescue, so conceptual flaws can mostly be avoided by using frameworks like JMH. JMH is the Java benchmarking house, and it is a tool that was created with the intention to help developers in avoiding common pitfalls when writing and executing Java benchmarks. So it's actually quite handy to use it. But also you have to be careful what you're doing. And here you can see one of my first tries where I was using JMH to write my own benchmark. I actually asked for some feedback on Twitter and got none, but didn't stop me from continuing my learning journey. You can see there are two things I'd like to point out here. One thing is that JMH provides you with black holes, which you can use to consume some objects in your benchmark. So this will make sure that the code is not eliminated by the JVM. You could also just return that or print it to system out that will have the same effect, but there are black holes then. Second is that you should also consider warm up and there's an annotation at fork and you can specify the number of forks which you want to execute. So how often the benchmark test should be executed in standalone jvms and also warmup iterations to actually avoid warmup when benchmarking your code. But in my case I wanted to measure warmup, so I set this to zero to get some observation. I tried several different approaches to write some good benchmark tests. I tried to reuse existing benchmark tests from the Dakarpo or spec JVM suite, but that all didn't work out for me. But in the end I ended up with a sudoku backdracking algorithm, which turned out to be working quite well for my case. So you can find that code on GitHub. I will not go into details there, but this is the code I used to benchmark the JVM warmup performance. So here's my test environment. I did all the benchmarking on a virtual machine, which is not optimal, but I tried to compensate that with multiple test runs. See that in a minute. So the operating system is a Ubuntu version 2064 bits, and I had eight virtual cpu cores based on AMD Opturam processor. There were eight gigs of ram available, no swap configured, and a storage of eight gig hard drive disk. My test setup I decided to execute my benchmark tests with 21,000 iterations to also see some effect when a method gets marked as hot. Every one consisted of 20 forks, which means that JMH will spawn up 20 independent jvms to not accidentally make use of already pre cared code. Then I executed twelve runs at different days and daytimes to eliminate these contextual effects I would face in a virtual environment. When you multiply all these numbers, 21,000 iterations in 20 forks and twelve runs, you get 5.4 million sudoku solved per JVM. Always the same sudoku though the JVM parameters. I did not touch much because I wanted to take the approach of simulating a daily user who would just throw code the JVM at the JVM and run it. Besides two exceptions, the one is that I was using the no operation garbage collector epsilon or respective other ones for the other jvms, and also the pretouch memory option, which I will explain in a minute. So here are my jvms under test I decided for the tried and trusted hotspot VM where I used an OpenJDK 64 bit build from adopt OpenjDK. And as you can see I also configured some alias for every jvm which I could use later and just shorten the amount of text on my slides. So secondly, I went for GraalVM, which is a polyglode VM. I used the community edition for my benchmark testing in version 22. And last but not least, Opengenine as an enterprise JVM, which actually promises to have better performance on its website than hotspot VM. We'll talk about that later. Yeah, with this test setup, I started my measurements. So let's take a look at the runtime flex which I used to execute my benchmark. This one is for hotspot VM, and let's go through the lines step by step. So here I specify the benchmark target, which is my backtracking algorithm, and this is just some syntax given by JMH. The next line I will have to provide some JVM arguments for JMH that it will use for every fork it spawns up to execute the benchmark. I used a configuration of 5gb of heap and also provided the flag heap dump on out of memory error to just show me if my JVM crashes. Next you see some double x flags like unlock experimental VM options, which I need to make use of the Epsilon garbage collector, which I mentioned earlier to avoid garbage collection interrupting my measurements. And then there's also the always pre touch option which will claim physical memory from the operating system right at the beginning rather than on the fly. So this would also eliminate some interference by the JVM when it would find out that it needs more memory. This flag will just tell JMH where to store the measurement output and in which format, so it can output things in adjacent format and also others. And last line, I specify the number of iterations, which is 21,000 per fork. I run 20 forks and the timeout is just set to 360 minutes, which is very high, but just didn't want to let JMH time but and abort my measurements. Okay, and the last line I just wanted to collect the output of my program into a log file. The runtime flags for GraalVM look quite similar. For one small exception, I did not find any no operation garbage collection algorithm for GraalVM in this version. So I made use of a workaround. I set the max new size parameter for libcall compiler to a number which is higher than the actually available memory for the heap, which makes the JVM create a huge young generation, but no old generation space in the heap. So what would occur here is that actually no garbage collection can occur, or before it would occur, the JVM would run out of memory. So it's important to have enough memory for your benchmark tests available. Opengen nine has also a slight difference here. I unfortunately found that the Linux version of Openj nine does not offer a pretouch option. So this one will claim memory on the fly if it needs more from the operating system. Okay, that was the setup. And now I would already like to share some test results. Here you see the overall chart which I generated out of the collected data from the benchmarking of hotspot vm. It's on the vertical axis, again the time per operation in nanoseconds. And on the horizontal axis, the number of iterations up to 21,000. If we now zoom in a little, you can see that there is a light red colored background of the warmup graph. And I call this light colored graph the scatter shade because this actually represents the scattering of the different fox individual data points of any given time slice. So they are the interquartile ranges, q one to q three, and the red line is the median value of the execution time. So on this slide, I again zoomed in to the first thousand executions. And here you can actually see that there's already in the beginning a significant drop in the execution time. There are several things we can observe here. First of all, we see that the scatter shade is tightly following the median curve and also narrowing over time. So that shows that the execution time is generally declining. Next, the median curve is also tending to be at lower bound of the interquartile ranges of the scatter shade. Which allows the conclusion that data points between the median and q three quartile under spread compared to the range from q one to the median. Which makes absolutely sense because there's a physical lower bound when executing and this behavior and the scatter shades can also be observed for the other JVM charts for GraalVM and Openj nine. Here we have the chart for GraalVM for the first thousand benchmark iterations. Both GraalVM and hotspot actually have this sudden decline at around 100 executions where the execution time significantly drops. There's not only in the median curve, but also in the scatter shade this significant decline. And we also see that at this point, the q three boundary. So the upper part of the scatter shade will eventually fall below the q one boundary of previous data points. So I tried to visualize that with this red bar. You see that here the under bound of the scatter shade is below the lower bound. That's another view on the GraalVM warmup chart between iteration 6000, 406,800 GraalVM actually shows this bump. And I did not dig into details here because I didn't have a good profile at hand. However, I think it would be definitely interesting to investigate this anomaly. If you have any guess what this bump is about, please let me know. So the blue chart is for openj nine. Again, we look at the first thousand iterations for this benchmark. You can already see that the warmup chart of openj nine looks somewhat different than the others. So first of all, there is no sudden decline at the mark of 100 iterations, but instead there are some spikes in the execution time for single iterations. You see that here are some spikes, and also later on they're getting less over time, but they are always present. I was thinking, okay, maybe these spikes could be cared by the missing pretouch option, which is not available in open gen nine for Linux. To find out if this behavior could be, or the spikes could be attributed to the missing pretouch option. I would have expected to observe similar behavior for the other two jvms when I disabled the always pretouch option for them. So therefore I made another measurement series with GraalVM and hotspot having the always pretouch option disabled. But the warmup charts looked the same. There were no spikes for GraalVM or hotspot. There were no hints for my suspicion, which leads to the conclusion that in my test setup, fetching actual memory from the operating system had only minor or even no effect on the measurement series. And these spikes in the warmup graph of opengenine cannot directly be attributed to the fetching memory from the operating system. Okay, so up to now we just had a look at each JVM individually. Now I'd like to continue to compare them. To get started, I just talk about the average execution times. So on the left hand side you see a histogram which includes the execution times for opengen nine, hotspot and GraalVM, all in JIT compiler mode. You can see that the histogram for hotspot and GraalVM looks quite similar, and opengenine describes a rather different curve. However, they all have this tail to the right. The average execution time for hotspot and GraalVM is around 0.4 milliseconds. Graalvm seems to be a little bit faster, and Openj nine is following tightly at almost 0.5 milliseconds. Then I also made some measurements where I enabled the Aot compiler for opengenine, and this one turned out to be faster than opengenine in Jit code, but still slower on average than GraalVM or hotspot. So that's also what you see here on the right hand side in the chart. Purple curve is opengenine in Aot mode. It's faster than OpenJ nine in Jit mode overall. Okay, let's dig deeper. One interesting thing to observe in the warmup charts is the amount of time, the number of iterations it takes to speed up the method execution from five milliseconds to 0.5 milliseconds. I'm talking in the unit of milliseconds, because that's easier to pronounce. But just don't get confused by the scales here. It's still nanoseconds on the chart. The first red bar is at five milliseconds, the second one is at 2.5 milliseconds, and the third one is at 0.5 milliseconds. So for this blue chart, which represents openj nine, it takes 150 iterations to gain a 90% performance improvement within the first iterations of the benchmark test. So in numbers, this means that for every next execution, the JVM can execute the method 0.3 nanoseconds faster than the previous operation. If we take a look for this KPI at hotspot, we see that the negative slope is not as steep as in open G nine, and we can also prove that by calculating it. So, reaching the lower bound of 0.5 milliseconds from the beginning, where we start at five milliseconds, takes around 700 executions of the benchmark method. So we can say that with every next execution, the JVM or hotspot can speed up the method execution by zero point 63 nanoseconds per operation compared to the previous operation. Which means that during the first few iterations where warmup takes place, hotspot Vm is 4.6 times slower than open gen nine. If we compare all the three jvms together, you will see that Opengen nine will only win the race within the first few hundred iterations. But if we compare that after around 600 iterations, we'll see that the blue chart is above the green and red chart of hotspot and gravm, which means that in the end, Openj nine will be slower than its opponents. But just right at the beginning, it's warming up faster. I also promised to shortly talk about JIT compilers versus Aot compilers, and for that I made some measurements with OpenJ nine jit mode, which is the blue graph again. And in AOT mode, which is the purple graph here you can see the flags you need to provide to enable the Aot mode on open genine, and you can easily spot that right from the beginning. The Open Geni Aot compiler starts at its maximum performance and executes the code always in the same speed, while the jig compiler will take up on that after a few hundred executions again. So having all these nice looking charts is quite cool actually. But I also wanted to know what's actually happening there. Why is the warm up as it is, and what's causing it? So for that I found ditchwatch which is a block analyzer and visualizer for the hotspot jig compiler, and it's a really cool tool actually. You can enable it with these flags if you provide these runtime flags on your jvm. However, you have to know that this will have a negative impact on performance, so do not do that during your benchmarking, but just afterwards to investigate. And the output file, which is a XML log file you can just load into jitwatch afterwards. Then jitwatch will show you the compilations for every single method. So here's an example for compilation list of the method solve integer array, which is one of the methods in my Sudoku benchmark tests. You see actually that there are some c one compilations happen and happening, and also some c two compilations also on stack replacements, but all only after 20 seconds, which is actually like half of the time the benchmark test runs. So this is way beyond the initial warm up we saw, and actually they do not have a lot of effect on the execution time anymore. So I was wondering what else would then cause the warmup in the initial 1000 iterations if all these compilations shown by Jitbotch kick in much later. So while Jitbotch is a useful tool to visualize the actions of the JIT compiler, I encountered discrepancy between the compilations shown by Jitbotch and the JIT compiler actions locked on the terminal by providing the runtime flex print compilation and print inlining. So the terminal lock output showed several inlining operations taking place already during the first iterations of the benchmark execution. These inlining operations also fit to the warmup charts where we see a steep decline over the first few hundred or thousand iterations. So these inland operations cared probably the main driver for the fast decline in the warm up graphs we've just seen. The difference between the XML compilation log file used by Jitwatch and the compilation log output on the terminal can actually be explained by the fact that there's a limitation in the log compilation option, which leads to the fact that these inline decisions made by the c one compiler early on are not included in the XML log file which is used by Jitwatch. You can read the details here where I provided a link to the OpenJDK wiki. Okay, now I'd like to draw short conclusion and also have some additional remarks to my benchmark measurements. First of all, all the benchmark measurements I made were done for JDK version eleven for all the mentioned jvms. I did not perform measurements in any other JDK version. Secondly, the benchmark measurements I conducted in October and November 2020. So meanwhile there are new versions of the jvms, so it would be interesting to also take a look at them. Yeah, and here are also my final thoughts. As just said, graalvm version 21 was recently released. It now comes with the espresso JVM, which is a JVM fully written in Java. It's Java on truffle, if you know what that means. Yeah, maybe I find the time to also do some warm up performance benchmarking on the espresso JVM. The second thought that comes to my mind is that Opengenine's benefit is definitely its Aot mode. So it's performing better in Aot mode, at least in my measurements. But I'm asking myself, why don't they make this the default configuration if they also advertise with it that they are faster than the hotspot VM? Last but not least, I think there are many other JVM that also deserve to be benchmark on warmup performance because they become more and more important in the world of different JDK releases. For example, the Amazon Krata JVM or Alibaba Dragonwell. All right, that was my presentation. I hope you like it. It was the first talk I ever held in public, and if you want to check out my references or take a look at the source code, you can find many more details on my blog post about that topic, which is linked here. If you have any questions about my measurements, my talk, my learnings, or want to discuss something, yeah, just send me an email. Here's my contact information and thank you for tuning in. See you.
...

Frank Kriegl

Software Developer & Java Enthusiast

Frank Kriegl's LinkedIn account Frank Kriegl's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways