Application-Level Chaos Engineering in JVM

Video size:

Abstract

During the talk, I will introduce the main research work we did recently on chaos engineering. We focus a lot on application-level chaos engineering in JVM.

For example, ChaosMachine provides unique and actionable analysis on exception-handling capabilities in production, at the level of try-catch blocks. TripleAgent combines monitoring, perturbation, and failure-obliviousness for automated resilience improvement, at the level of methods.

Summary

My topic today is the application level chaos engineering in JVM. Almost 90% of the failures or bugs are related to exception handling mechanisms. I'm going to share some of our recent research work with you based on the GitHub rifle.
We could use chaos engineering to inject exceptions to trigger your system and see the abnormal behavior. But we could also combines some self healing techniques to evaluate. This is the core idea of failure oblivious computing. Some of them could provide more resilient behavior.
Almost 600 Java projects really use Docker in their source code base. For pops it's automated observability for dockerized Java applications. We integrate triple agent and other monitoring tools into the base image. Then for developers they can declare new base images which are augmented with this observability and fault injection features.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Okay, nice. So let's start from this page. My topic today is the application level chaos engineering in JVM. Like I have already introduced myself. Let's continue to the second page. So, like several years ago, when people started to discuss chaos engineering, and when this concept was brought upfront to the developers, and then they began to explore the space of chaos engineering, like it was not only about randomly shutting down the instances, we could also explore different perturbation models. For example, we could add this, killing the whole region of the cluster. We could inject exceptions in the function in the method. We could also do this like time travel perturbations, or maxing out cpu cores out of an electric cluster, et cetera. So then we have different levels of chaos engineering, like infrastructure level, like the network level, and also like the application level. Based on different levels, we could learn different information from these attributes and abnormal behaviors. So that's why we focus on the application level chaos engineering, because we are quite interested in the observability of applicationlevel. And if it's directly related to the logistics and the exceptions in your source code, then maybe we could precisely control these exceptions and failures to see how the system behaves. So for sake of open science, we also put Iris on GitHub, like our papers, source code and data side with some research experiments. So if you feel interested in the application level chaos engineering, you could also take a look at this GitHub repo because the university is called Royal Institute of Technology, so we call it royal chaos. Yeah. So today I'm going to share some of our recent research work with you based on the GitHub rifle. Basically we have like three cases here. The first one is called chaos machine. It's a chaos engineering system for live analysis and falsification of exception handling in the GVM. And this paper has been accepted in one of the top journals called IEEE transactions of software engineering this year. And I put an archive link which is free to access for your convenience. Okay, so about this, I would like to introduce some background work. First, based on some research work, it shows that almost 90% of the failures or bugs somehow related to the exception handling mechanisms. So for example, if I'm writing some programs in Java, and then I have to design lots of try catch blocks, like lots of exceptions could happen in production, and then we need to design some recovery methods in your catch blocks. But at least for me, I often write byte catch blocks. So even I catch this exception, but it doesn't provide enough recovery methods, or it's not resilient enough for my system to recover, to recover from this exception. So that's how we would like to evaluate these try catch blocks to evaluate the exception handling methods in the system. And previously, lots of work are done in a testing phase. For example, this is one of the testing method. It's called short circuit testing. It's just inject some exceptions in the beginning of a try block, and then your whole try block is made invalid and it goes into the catch block, and then it runs the test switch again to see how this behaves and whether it still passes the test cases. So we call it short circuit testing. But with chaos machine, we would like to apply this fault model into a production like environment. And then we do this bytecode instrumentation. So in the runtime, we instrument the bytecode and we throw an exemption in the beginning of the try block. That's how we trigger the catch block. So in some scenarios, if we design two different passes of the functionality or the handling logic, maybe it provides you like two different ways to fulfill the user's request. One is the normal way in try block. If something bad happens, then we have another way in the catch block to still fulfill the user's request. And that's something we would like to verify. So I also like this word verification in this title. We also call it falsification, because if we triggered an exception and your system behaves abnormally, it's good because we learn more from this. We could learn more information from this abnormal behavior, not just like building the confidence of the system in this exception handling mechanisms. So this is the architecture of the chaos machine. In a production like environment, we don't have access to the source code anymore, but we could use the nice feature provided by GVM, which is called the Java agent, to instrument the bytecode. So we attach a monitoring sidecar and a perturbation injector as agents to different services or different instances in JVM. And then every perturbator and sidecars are controlled by the chaos controller, which is responsible to do this chaos experiment. And the input to this chaos machine is then something about the arbitrary software and the hypothesis, I will introduce it later. So hypothesis is one of the quite important concepts in chaos engineering, because we would like to somehow monitor the steady state of the system and then set up a set of hypotheses. And then we run these chaotic experiments to verify or falsify the hypotheses. So regarding the tri catch handling mechanisms in Java, in JVM, we designed these four different hypotheses the first one is resilience hypothesis. Based on our monitoring set car, it could monitor the performance and the functionalities, for example the return code of the HTTP request or even the HTTP content of the body. And then we compare this modest information with normal behavior to see if this hypothesis is still verified. And observability hypothesis is something about the user visible effect. So when we trigger this exception, whether it has some impact on the impact on the end users, like if the user is able to feel this exception or it's just waiting for the response for a longer time. And the third one is debug hypothesis. This is related to some good practice in logging development. So if we try to use trycatch block to catch an exception and somehow we need to log enough information if this exception happens. So it's helpful for debugging. So we design this debug hypothesis to see if we could capture any information when the exception happens. And the silence hypothesis is interesting, but maybe it's bad for development. It means when the exception happens it fails to provide the expected behavior. And also we can't get any useful information from the user side and from the developer side. So what we can learn from these four hypotheses for chaos machine, basically we could try to classify these track hatch blocks. When we do these chaos experiments, we could find some fragile ones, for example, we just try, you try to handle this exception, but we provide an empty catch block. Sometimes it's normal, but sometimes it means like we are liking some thoughts on this exception and we don't provide recovery mechanisms for this exception. Then we could also learn like log handling mechanisms in bad code level. So for research paper usually we pick up some projects and we did some experiments on that to show how this tool works and what's the performance issue and the future work of chaos machine. So we use this ttorrent to evaluate the prototype we developed. And ttorrent is a file downloading client in Java which implements the BitTorrent protocol. So with the help of chaos machine we could detect different track hatch blocks. And some of them are covered by our workload because in production we can't make it 100% covered if we just run our simple experiments and then we could try to classify these track hatch blocks based on our hypotheses. That's why I'm also interested in the chaos monkey for spring boot, like how we design these different types of exceptions. So this is the first piece of idea I would like to share. It's called chaos machine. And actually I would say it brings some thoughts on the different perturbation models. Like we could perturb the networks, we could also perturb it on the applicationlevel with some bad code instrumentation method. And the goal of chaos engineering is to verify your system and to build the confidence and finally to improve the resilience of the system. So we would also like to try to combine different techniques like the chaos engineering one, faulty injection techniques, and also the self healing techniques, for example, failure obliviousness. This is the second system we developed called triple agent. So monitoring perturbation and failure obliviousness for automated resilience improvement in Java applications. I think we could also call it automated resilience improvement suggestions. So we would like to leave these decisions for developers because as a research prototype you can't directly use it in production. I think it's even more dangerous. But with the fault injecting experiments, we could provide some suggestions like okay, you could handle these exceptions in another way, maybe it's better. And this information could be helpful for the developers. And since I come from China, I would like to share another chinese golf with you. This one is called ambidext charity, which is quite interesting. In one of the fictions, this guy uses two different kinds of technologies. So on his different hands, on his left hand he used one of the techniques which is mainly used to defend another hand is uses as an attacker. But he trained himself like using these two different techniques and both of them could be improved for practice. And this inspires us like okay, we could use chaos engineering to inject exceptions to trigger your system and see the abnormal behavior. But we could also combines some self healing techniques to evaluate like okay, how chaos engineering works, and with the help of chaos engineering, how your self healing strategies works. So we combine these two technologies and one of the basic thing is monitoring. So that's why we call it triple agent monitoring. Chaos engineering and self healing as a simple example. So in Java we have lots of trycatch blocks. We also have a chain of invocation like methods zero cost. Method two costs method one, and method one cost invokes methods zero. But in methods zero we could have some methods which declares to throw exceptions and then you must handle it in your source code with a try catch block. So this is a simple example. Like the exceptions EA and EB are handled in method two, the catch blocks are written by developers. But actually when we do the bad code instrumentation, when we throw an exceptions on purpose, in this method zero, we could make the exceptions propagate to the default catch handler. But we could also intercept these exceptions in the middle of the chain, like if I try to intercept the exception in method one and catch and ignore this exception in method one, then we have these two different behaviors and we could compare them. Which one is better? This is the core idea of failure oblivious computing. It's possible to catch and ignore some of the failures before they go into the default handler, and some of them could provide more resilient behavior. That's something we would like to confirm, to verify, and to compare. Also, we did the experiments with TTorrent and we do have these three different categories of the perturbation point, like the methods which declare to throw exceptions. So we have lots of fragile perturbation point. It means if we only inject one exception, it makes your system crash or it makes the system run abnormally forever. We also have some sensitive perturbation points. So if we inject exceptions, only one exception in this perturbation point, everything is fine. But if we keep injecting exceptions, then your system gets stalled or get crashed. And we also by default have some immunized points. So it means the catch block behaves pretty much well, pretty well to handle these exceptions. No matter how many exceptions we injected the tTorrent, the system still behaves well enough to fulfill the user's request. And then we ran this failure oblivious computing experiments to intercept exceptions and compare the behavior again. Then we could have the figures on the right, like there is some improvement suggestions for developers. So this is the second idea I would like to share. Actually the overhead of the triple agent is quite acceptable because we only instrument the bytecode to inject some exceptions and it happens only in these methods which declare to zero exceptions. So the only thing I could see is about the cpu time. Sometimes it causes more computation resources. Okay, and the final one. So I think I still have time. So I could show you some simple command here first. So this is a folder of a small demo of triple agent. And usually when we run some Java applications we could use this option like agent path to define what kind of Java agents you would like to attach. So this part is, I think everyone is familiar with this one. We just run the tTorrent application by default like Java Jar file, and then we try to download a centralized iOS package. And then we attach the monitoring agent and the fault injection agent with some parameters like the mode is throw exceptions, so it gets a list of the possible positions based on the methods based on their throw keywords. And then we could indicate the location of the perturbations and the default mode is coverage. So this means we just output some information without really injecting exceptions. We could change it to throw exceptions if we would like to conduct experiments. But as you can see, there are lots of parameters and it still feels tedious and complex to do this experiment. So that's why we also design some come up with some ideas about the pipelines and the automation tools. So for pops it's automated observability for dockerized Java applications because now Java developers also apply lots of stuff in docker and it wraps their applications in the Docker image. So first of all we conducted an empirical study about the Java applicationlevel, like how they use Docker on GitHub. So we mined like 1000 Java GitHub projects based on their popularities. And then we found more than almost 600 Java projects really use Docker in their source code base. And we also analyze their docker files to see okay, what are the most popular based images in their docker files. So Java eight, OpenGDK eight and Iopan ranks first and second. And we could also see ubuntu here in the list. So basically some developers install GDk by themselves and some of them just use like Java eight or OpenJDK eight which have JDK installed already and as a basic workflow. So developers usually declare a base image like okay I would like to use Java eight as a base image and then I add more commands to write my own applications and build an image so that I can publish it on the docker hub. But as pops we could also integrate some of the features into the base image. Then for developers they could just replace the front line and to declare a new base images which are augmented with this observability and fault injection features. So we integrate triple agent and other monitoring tools into the base image and then we provide this base image for developers so they just replace the frontline and build a new docker image for their applicationlevel. In this way it's also convenient that we could have two different types of containers in a production like environment and then one is just running normally, another one I could turn on more monitoring tools and faulty injection experiments. So we could compare this to the behavior of these two different containers. And I could also show you the pop system here. So this is quite a simple dockerized Java program which has ttorrent in it. And I could also build a docker image here so that I could run the original image to download the file. So this is a simple normal execution of this application and then we could use the augmentation tool to add this triple agent and extra monitoring tool to the application. Now it shows that, okay, we augmented the OpenGDK file. Then it provides a new base images called OpengDK pops. So for developers we could only replace this line like previously it was OpengDK, and now we could use royal chaos OpengDK pops to have these 40 injection features. And now I'm going to build the augmented base images. And finally I would like to run the augmented base images. So you see it's still like running normally, but if I use this one, you could get some extra dashboard. So this is another nice monitoring tool called glueroot. We could monitor the GVM, we could monitoring some application level metrics, and based on this configuration file, based on this configuration file, we could also monitor the different paturation places like okay, this location, like in which method, which class, I could actively throw exceptions with the happening rate and some extra parameters to do that. Okay, and as a summary, so today I shared the nice repo with you. Hopefully you will like it. And I also introduced three different ideas in our research group. The first one is chaos machine. On the try catch level it actively injects exceptions in the try block. The second one is triple agent. It happens on the methods level. It analyzes the serial keyword and injects exceptions in the methods body so that you could compare your system under perturbations under normal. And the final one is about the dockerized Java applications. So it's a pipeline which augments the base images to provide extra ability and photo injection functionalities. I think that's all for today, and thanks for listening.

Slides

Download slides (PDF)

See all 11 talks at this event!

Conf42 Chaos Engineering 2020 - Online

January 23 2020 - premiere 5PM GMT

Application-Level Chaos Engineering in JVM

Video size:

Abstract

Summary

Transcript

Slides

Long Zhang

PhD Student in Computer Science @ KTH Royal Institute of Technology

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2020 - Online

January 23 2020 - premiere 5PM GMT

Application-Level Chaos Engineering in JVM

Video size:

Abstract

Summary

Transcript

Slides

Long Zhang

PhD Student in Computer Science @ KTH Royal Institute of Technology

Join the community!