Conf42 Enterprise Software 2021 - Online

Mutation Testing with PIT

Video size:

Abstract

How well tested is your system? How do you measure it? Code coverage can give you the answer, however can we trust our unit tests? Trust me or not, but I used to work for the investment banking industry in a big project where a lot of unit tests had no assertions (!). And yes… the coverage was very high.

Mutation testing is a method to check the quality of your unit tests and show more reasonable code coverage reports. In this session I will describe the idea of mutation testing and show a live example with the use of PIT Mutation Testing framework.

Summary

  • Rafael is a cloud native team lead at Hazelcast before he worked at Google and CERN. Hazelcast is a distributed company and it is distributed in two meanings. First meaning is we produce distributed software. Second meaning is that we all work remotely.
  • Mutation testing was discovered by Richard Limpton in 1971. It was not really used until 2012. There is a method of testing your tests without writing more tests. The whole idea behind mutation testing is behind the whole idea of the idea.
  • For Java, there is a very good library called pit. And you can use this tool, the speed mutation testing tool. You can use it from command line, you can integrate it with maven or gradle, or use it as plugin to intellij. How can we improve it? Obviously write better unit tests.
  • Richard Limpton discovered this in 1971. The first like Java mutation testing framework was created in 2000 and the pit was in 2012. There were two reasons why mutation testing was not widely used. One was the problem with equivalent mutants. Another was slow performance.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, I'm Rafael and my talk is titled Mutation Testing with Pete. But first, a few words about myself. I am a cloud native team lead at Hazelcast before I worked at Google and CERN. I'm also an author of the book continuous delivery with Docker and Jenkins, and from time to time I do conference speaking and trainings, but on a daily basis. I'm a developer and I live in Krakow in Poland. A few words about Hazelcast Hazelcast is a distributed company and it is distributed in two meanings. First meaning is we are distributed company because we produce distributed software. But the second meaning is that we are distributed company because we all work remotely. We always work remotely. So I'm from Krakow, from Poland, but I'm the only one from Krakow. We have people all over the world. Our products are hazelcast in memory data grid, which is a library for distributed computation, for caching. Hazelcast jet is a library for stream processing, and hazelcast cloud is hazelcast, but as a service. So do you know why the NASA spacecraft burned in the atmosphere of Mars in 1999? And do you know that it is somehow related to the fact that even though mutation testing was discovered in 1971, it was not really used until 2012? So the answer to all of these questions, as well as the whole idea and the implementation of mutation testing, you will all find this in this presentation. But first, imagine the following scenario. So there are like two engineers discussing. I wrote code for the spacecraft. How do you know that it works? I just know, or I feel it. So NASA code, that is actually what happened in NASA. So the code of the NASA spacecraft, it was not tested. So in December 1998, NASA launched the climate orbiter spacecraft into the space. And this divide weighted almost half of a ton and was sent to the Mars. It takes around half a year, like six months, to go from Earth to Mars. And everything was fine until September 1999, when all of a sudden the contact with this device was lost. What happened? So there was a big in the code. So the ground computer calculated everything. In no metric system, they used pounds. However, the orbiter used metric system, and there was no conversion in between. So the speed of the orbiter was too fast, and the orbiter passed too quickly through the atmosphere of Mars and burned. So there was no people on board. So we can say, okay, shit happens, it's just money. But you know what? These NASA devices there are super expensive. So this one costed more than 300 million of dollars. There's around 234 Americans working all their lives for these orbiters. In case of Poland, it's even worse, you would need like 700 polish people working all their lives to get this money. And that is just because of the bug in the code. So how this conversation should look like. So I wrote code for the spacecraft. How do you know that it works? I wrote unit tests, but how do you know that your test works? But really. So we write code, and then we write test to test the code. And then we write test to test the test of the code, and write the tests of. To test the test of the code. And it just doesn't make sense. So, if we are not sure that our tests are good, so does testing make any sense at all? And actually, yes, there is a method of testing your tests without writing more tests. So imagine we have the following code. So just the simplest possible code. A plus b. So this could be a production code of, I don't know, some calculator. So we write unit test. How do we check that our unit test works? We could run test coverage. But what does the code coverage check? So, according to common sense, or according to the guru of the common sense, the Uncle Bob coverage does not prove that you have tested every line. All it proves is that you have executed every line. And that is a big difference. If you think about it like this return statement here, it would be perfectly covered with the test, without any assertion. Just imagine the following test. And we have a test like this, and we have a 100% coverage, but we haven't tested anything. So we need something better. We need something better than code coverage. And actually there's something better. It was discovered by Richard Limpton in 1971. So Richard Lipton, he asked like a more fundamental question. He asked like, why do we write tests? And he came to the conclusion that we write tests to detect bugs. So think. But it for a moment, if you are sure that your code does not have any bugs, then there is no reason to write tests. So the test is good when it catches bugs. So we could reverse this strategy and introduce artificial bugs, and check if our test detects these artificial bugs. And that's what Richard Limton suggested in his paper. He mentioned that if you want to know if a test suit has properly checked some code, introduce a bug, how to do it in practice. So let's go back to our example, simplest example possible. We have our statement a plus b. So how can we introduce a bug here? We could reverse the statement and change it to minus. That is clear, a bug, because we wanted to have a sum, but we actually have. We subtract the values. So what I did here actually I created a mutation of the production code with an artificial bug. And now we can check, okay, if our test suit fails on this bug. If yes, our tests are good. If no, our tests are useless. So you may ask, like, is it the whole idea behind the mutation testing? And actually, yes, that's the whole idea of the mutation testing. But before we go any further, let's set up the terminology we will use. So, this artificial big is called a mutation operation. Code with the artificial bug is called a mutant. When the test fails on a mutant, we say it killed the mutant. However, if the test succeeds, even if we introduce this big, even if create a mutant, we say that the mutant survived. So, coming back to our example, if the mutant is our mutant, a minus d is killed, then our tests are good. If it survives, our tests are bad. So in this case, like, killing is good. So if all the mutants are killed, our test suit is perfectly fine. Now, the first thing you can think about is that, okay, but my code is much more complex than just adding two numbers. And that is why we have a lot of different mutation operations. And there are so many of them that we even put them into categories. So, first thing, how you can mutate your production code is to do some math changes. So we change some plus to minus. We change multiply to divide minus to plus. So we change all the math operations. Second thing, we change the boundaries, so we had less than a, less than b. So we changed it to less or equal than b. We can also negate the statements, and we have some more complex mutation operations, like remove if statements, remove method calls, modify return statement, modify some constants, and there are even more. So now what do we do? We have our production code base, like the whole code base. We create a mutant, so we use mutation operation to generate a mutant. We actually create a lot of them because we have a lot of mutation operation. Our code is big, so we create a lot of mutants. And now, if all the mutants, if for every mutant, at least one tests fails, we killed the mutant, so it's all good. However, if there is at least one mutant that survived all our test scenarios, it means that we didn't cover this code by any test. So it is bad. So the next thing you can ask is like, but do I need to change the code on my own? And luckily, no. For Java, there is a very good library called pit. And apart from being a good library, it has this great logo, one of the best logo on my personal classification of logos that goes, trust after Docker. This logo of this bird is great. And you can use this tool, the speed mutation testing tool. You can use it from command line, you can integrate it with maven or gradle, or you can use it as plugin to intellij. I actually always use it with my intellij. Just click ok, check if my tests are good. So let's see our example again. So we have our calculator method. We have our test, which provides 100% coverage, but actually it tests nothing. So how does it look like in practice? So let's see a short demo. How to run the speed mutation testing on this simple example. So what we're going to do, we have just one class with the calculator. This is exactly what we've seen on the slide. The simplest possible code. And we have a unit test for this with our unit test provides 100% coverage. But that's nothing. We can run the tests. Obviously it passes. It will pass even though we had no production code. We can even check what is the test coverage of this. And yes, it's 100% class method line. Everything is perfect. So now we need to improve this process of checking our tests. So we will use the pit. I already have the pit plugin installed into my intellij, so I can run this edit configurations and then add a configuration for a pit mutation testing runner. So when I edit, I need to specify a few parameters. I can give it a name, but doesn't matter much. I can specify the target classes, the source directory, the report directory. So if you look closer, so it just specify where should be the pit report generated. And that's basically it. So if I click ok, I can run this mutation testing framework and see the results in the results here, it created two mutants automatically. And you see, they were generated, but they were not killed, which is obviously bad. So, coming back to the slides, so what we've seen, this is the result of our pit mutation testing. Obviously the mutant was generated because this plus was changed to minus, but it survived, meaning our tests are useless. How can we improve it? Obviously write some better unit tests. In our case, we can change it to the proper unit tests, like giving some ab values given when we summit, we should assert it to free. This looks like a valid unit test. And this, if we run it again, it will result in an output like this, meaning mutant was generated, but it was killed. Perfect. Our unit tests are great. Now do I need to read the console? So, I mean, the reading console is not the great way of presenting the test results. And luckily no pit provides a very nicely generated HTML report and let's see how we can use it. So let's continue with this demo with our corrected test. So if we go to our intellij, we have corrected our unit test and if we run again our pit test runner, we should see that, okay, mutants were killed, but apart from the fact that it was killed at the end of the log, we can see open report in a browser. So we can open here directly open from intellij, the report that was generated in the browser and in the browser we have a very nice report with the code coverage according to mutation testing. So it's way better code coverage than the standard code coverage. We can browse it by packages, by classes and we can see what happened with our code. So this is a perfectly well covered code according to mutation testing. We can even see that two mutants were created out of this line and all of them were killed. So this is the output you are looking for. Okay. But the next thing, when I first heard about mutation testing, I thought the idea is great, I buy it. However, it will never work for a bigger project because my project is way bigger than just one calculator class, how it's possible that it works. And what I tried here, what we actually tried in hazelcast was to try it and use it on one of our plugins, which is hazelcast Kubernetes plugin. It has around 5000 lines of code, so it's still small, but it's like reasonable size. It has twelve classes, so not a very big code base, but already something that is useful. So let's see in the demo how to run the same pit framework on the hazelcast Kubernetes plugin. If you go to the hazelcast Kubernetes page, what we did is if you would like to run it from the gradle or from the maven, you need to add the dependencies to pit. So in maven we added this with the profile. So we have a pit test profile and it's enough to add this part. And that's everything you need to change in your project to actually automate it. So with this part we can see how it works. From the command line we can clone this project. If we clone this project, we can. Then let's code this. And then if we open this directory with the project we can run the command with pit test with our profile. Pit test tests, mutation coverage this command will generate for us the report. So actually it takes some time for the pit to generate it. So maybe I will not show it here, but let's see how we did it. So what we did with this Kubernetes plugin, we added a GitHub action which runs this. Every time you push to the master, this GitHub action is run. And we run exactly the same command with the pit mutation coverage in our GitHub action. But apart from that we also publish this result to the GitHub pages. So we always have in the GitHub pages our current result of the mutation test. That is actually great. You can always go to this GitHub page and see the results. And with this GitHub action we can see how it was run. It's actually great because after every push to the master we have pit results. We can have a look how it runs. So pit coverage, it took actually 2 minutes and a half for this GitHub action to execute all the tests, mutation, everything. You can see that there is some mutant killed and the results are automatically, the report is automatically published to the GitHub pages. So we always have at the GitHub pages current pit test coverage report. So we can open this and see where are the things that are not well covered with our tests. And if we look at the, for example, this class looks like not well covered. So we can see, okay, this line looks not covered, right? This looks like the whole method is actually not covered. We can see like what? So if we change it to return now, it's actually no test catches it. So it's really bad and so on and so on. So we have like a code coverage, but done way better than with the standard code coverage tool. So I guess I already convinced you to use it. I mean, it looks great. So the next question is like why NASA Engineers didn't use it in 1998. So Richard Limpton discovered this in 1971. The first like Java mutation testing framework was called I guess Jepster, and it was created in 2000 and the pit was in 2012. So actually why it took so many years from the idea to go to something that you can use on the code and you can use as a developer. And it happens that there were two reasons why it was not used, widely used for such a long time. And the first reason was the problem with equivalent mutants. So let's look at this code. So this is a good code, I mean it makes sense, this code, but now think about it like if we mutate the second line, this code is the same. Actually the semantic of changing this mutant is the same. So we create a mutant, but no test will kill this mutant. So we have a false negative and that is like problem. Actually, the pit didn't solve this problem because this problem is not easy to solve. How to eliminate these equivalent mutants. However, what Pete did is that the mutation that are highly probable to generate these equivalent mutants, they are disabled by default. You can still enable them, but we don't want to have false negatives here. But there was also a second problem, why mutation testing was not widely used. And that is because of the slow performance. Because think about it, like from our code base, we create a lot of mutants. I mean, we can change a lot of things, so it results in a lot of mutants. And now we have a lot of tests. So you can already guess the problem. So the problem here is that we need to check every combination test and it's super time consuming. So what pit actually did, and it was quite smart. Before running this mutation testing, they run the normal code coverage and see which test covers this part. So if we know that this test covered this code, then we don't need to run all the combination. It's enough if we run this test for the mutants related to this code and it actually speeded up the whole process very, very well. So we don't run all tests on all mutants, we just run the test that may kill the mutant for the given mutant. Okay, that's cool. Actually, I guess I already convinced you to use mutation testing. But you may ask, but what about my team? They are lazy, they will not use it. And let me tell you a story here. So I worked in the banking industry some time ago, and you can imagine, like banking industry is like NASA, it's like super important, big money. And our team was distributed across three locations. One was in us, the other was in Krakow in Poland, and the last one was in Hyderabad in India. And the story is that each team was quite independent. So each team developed their own part of the system, their own modules. And we wanted to have 100% code coverage because we wanted to have high quality. That was the most important, because banking industry compliance, you need to have code well covered. The system was constructed in a way that each module had some main class. And the main class, I call it main class paradox, because main class is difficult to test because you cannot create a simple unit test for the main class. Because if you create a unit test to run the main class, you're basically testing the whole module, not only the main class. So there was, no matter how I tried to tests it with unit test, I always could go to 95% code coverage, never 100%. And then if you don't know how to do something. If you see the problem, you see, okay, let's see how other teams do it. And then I check the code from the team from India. So I look at the code, I check all the code, check all their tests. And what I found is that I found, okay, for each test there was given when, but there was no assertion. So they started to create tests with no assertions because there was a requirement for 100% code coverage and they needed to fulfill it somehow. And I don't know, I don't know what to do with the people who write tests without assertion. Should we laugh on them or should we get angry? But the moral of the story is that the test coverage threshold doesn't make any sense, because if you do it for your project, people will try to work around it if they need to. But you still want to have some cold coverage. So luckily there is a better way. What you should do is put your pit framework into your continuous integration or continuous delivery pipeline and generate these reports of cold coverage. And once a week, or once every other week, have a meeting and look at this report. Not to have like a threshold, you need to have 100% code coverage, but rather look at this HTML report, because that is the way you have less technical depth and improve the quality of your code. So the last thing is like, but do people really use mutation testing? Is it like widely used in an industry? And actually, yes, there are a lot of companies. It is used in CERN, it was used in the Norway voting system. And more and more companies are introducing mutation testing to their pipelines because it just improves the code quality. It's just better measurement than code coverage. So test your tests because it gives you the freedom, like you can easier refactor the test, refactor the code. You can trust your, you know, some shit can happen, or you can lose a lot of money, or you can feel shame or had laughter or anger. Thank you for listening to this presentation.
...

Rafał Leszko

Cloud Native Team Lead @ Hazelcast

Rafał Leszko's LinkedIn account Rafał Leszko's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways