This transcript was autogenerated. To make changes, submit a PR.
Okay, nice. So let's start from this page.
My topic today is the application level chaos engineering in
JVM. Like I have already introduced myself.
Let's continue to the second page.
So, like several years ago, when people started to
discuss chaos engineering, and when this concept was brought
upfront to the developers, and then they began to
explore the space of chaos engineering,
like it was not only about randomly shutting
down the instances, we could also explore different perturbation
models. For example, we could add this,
killing the whole region of the cluster. We could inject exceptions in
the function in the method. We could also do
this like time travel perturbations, or maxing out cpu
cores out of an electric cluster, et cetera.
So then we have different levels of
chaos engineering, like infrastructure level,
like the network level, and also like the application level.
Based on different levels, we could learn different information
from these attributes and abnormal behaviors.
So that's why we focus on the application level
chaos engineering, because we are quite interested in the
observability of applicationlevel. And if it's
directly related to the logistics and the exceptions in
your source code, then maybe we could precisely
control these exceptions and failures to see how the
So for sake of open science, we also
put Iris on GitHub, like our papers, source code and data
side with some research experiments.
So if you feel interested in the application level
chaos engineering, you could also take a look at this GitHub repo
because the university is called Royal Institute
of Technology, so we call it royal chaos.
Yeah. So today I'm going to share some of our recent research
work with you based on the GitHub rifle. Basically we
have like three cases here. The first one is called chaos machine.
It's a chaos engineering system for live analysis
and falsification of exception handling in the GVM.
And this paper has been accepted in one of the top journals called
IEEE transactions of software engineering this year.
And I put an archive link which is free to
access for your convenience. Okay, so about this,
I would like to introduce some background work.
First, based on some research work, it shows that almost
90% of the failures or bugs
somehow related to the exception handling mechanisms.
So for example, if I'm writing some programs in Java,
and then I have to design lots of try catch blocks, like lots
of exceptions could happen in production, and then we need to design
some recovery methods in your catch blocks.
But at least for me,
I often write byte catch blocks.
So even I catch this exception, but it doesn't provide
enough recovery methods, or it's not resilient
enough for my system to recover, to recover
from this exception. So that's how we would like to evaluate
these try catch blocks to evaluate the exception handling methods
in the system. And previously, lots of work are
done in a testing phase. For example, this is one of the
testing method. It's called short circuit testing. It's just
inject some exceptions in the beginning of a try block,
and then your whole try block is
made invalid and it goes into the catch block,
and then it runs the test switch again to see how this
behaves and whether it still passes the test cases.
So we call it short circuit testing. But with chaos machine,
we would like to apply this fault model into
a production like environment. And then we do this bytecode
instrumentation. So in the runtime, we instrument the bytecode
and we throw an exemption in the beginning of the
try block. That's how we trigger the catch block.
So in some scenarios, if we design
two different passes of the functionality or
the handling logic, maybe it provides you like two
different ways to fulfill the user's request.
One is the normal way in try block. If something
bad happens, then we have another way in the catch block to still fulfill
the user's request. And that's something we would like to
verify. So I also like this word verification in
this title. We also call it falsification, because if we
triggered an exception and your system behaves
abnormally, it's good because we learn more from
this. We could learn more information from this abnormal behavior,
not just like building the confidence of the system in
this exception handling mechanisms. So this is
the architecture of the chaos machine.
In a production like environment, we don't have access to
the source code anymore, but we could use the
nice feature provided by GVM, which is called the
Java agent, to instrument the bytecode. So we attach
a monitoring sidecar and a perturbation injector as agents
to different services or different instances
in JVM. And then every perturbator and sidecars
are controlled by the chaos controller, which is responsible
to do this chaos experiment.
And the input to this chaos machine is then
something about the arbitrary software and the hypothesis, I will
introduce it later.
So hypothesis is one of the quite important
concepts in chaos engineering, because we would like to
somehow monitor the steady state of the system and
then set up a set of hypotheses. And then we run these
chaotic experiments to verify or falsify the hypotheses.
So regarding the tri catch handling mechanisms
in Java, in JVM, we designed these
four different hypotheses the first one is resilience hypothesis.
Based on our monitoring set car, it could monitor the
performance and the functionalities, for example the return
code of the HTTP request or even the HTTP
content of the body. And then we compare this
modest information with normal behavior to see
if this hypothesis is still verified.
And observability hypothesis is something about the
user visible effect. So when we trigger this exception,
whether it has some impact on the impact
on the end users, like if the user is able
to feel this exception or it's just waiting for
the response for a longer time. And the third
one is debug hypothesis. This is related to some
good practice in logging
development. So if we try
to use trycatch block to catch an
exception and somehow we need to log
enough information if this exception happens.
So it's helpful for debugging. So we design
this debug hypothesis to see if we could capture
any information when the exception happens. And the
silence hypothesis is interesting, but maybe it's bad for
development. It means when the exception
happens it fails to provide the expected
behavior. And also we can't get
any useful information from the user side and from
the developer side. So what
we can learn from these four hypotheses for chaos
machine, basically we could try to classify these track hatch
blocks. When we do these chaos experiments, we could find some
fragile ones, for example, we just try, you try to
handle this exception, but we provide an empty catch block.
Sometimes it's normal, but sometimes it means like we are
liking some thoughts on this exception and we
don't provide recovery mechanisms for this exception.
Then we could also learn like log handling mechanisms in bad
code level. So for research paper
usually we pick up some projects and we
did some experiments on that to show how this tool works
and what's the performance issue and
the future work of chaos machine. So we use this ttorrent
to evaluate the prototype we developed. And ttorrent
is a file downloading client in Java which implements the
So with the help of chaos machine we
could detect different track hatch blocks.
And some of them are covered by our workload
because in production we can't make it 100%
covered if we just run our simple experiments
and then we could try to classify these track hatch
blocks based on our hypotheses. That's why I'm also interested in
the chaos monkey for spring boot, like how we design these different
types of exceptions. So this is the first
piece of idea I would like to share. It's called chaos machine.
And actually I would say it brings some thoughts on
the different perturbation models. Like we could perturb the networks,
we could also perturb it on the applicationlevel with
some bad code instrumentation method.
And the goal of chaos engineering is to
verify your system and to build the confidence and finally
to improve the resilience of the system.
So we would also like to try to combine different techniques
like the chaos engineering one, faulty injection techniques, and also the
self healing techniques, for example, failure obliviousness.
This is the second system we developed called triple
agent. So monitoring perturbation and failure obliviousness
for automated resilience improvement in Java applications.
I think we could also call it automated resilience improvement
suggestions. So we would like to leave these decisions for developers
because as a research prototype you can't
directly use it in production. I think it's even more dangerous.
But with the fault injecting experiments,
we could provide some suggestions like okay,
you could handle these exceptions in another way, maybe it's better.
And this information could be helpful for the
developers. And since I come from China, I would
like to share another chinese golf with you. This one
is called ambidext charity, which is quite
interesting. In one of the fictions, this guy
uses two different kinds of technologies.
So on his different hands, on his left hand he used
one of the techniques which is mainly used to defend
another hand is uses as an attacker.
But he trained himself like using these two
different techniques and both of them could be improved for
practice. And this inspires us like okay,
we could use chaos engineering to inject exceptions
to trigger your system and see the abnormal behavior.
But we could also combines some self healing techniques to
evaluate like okay, how chaos engineering works, and with
the help of chaos engineering, how your
self healing strategies works. So we combine
these two technologies and one of the basic thing
is monitoring. So that's why we call it triple agent
monitoring. Chaos engineering and self healing
as a simple example. So in Java we have lots
of trycatch blocks. We also have a chain of invocation like
methods zero cost. Method two costs method
one, and method one cost invokes methods zero. But in methods
zero we could have some methods which declares to
throw exceptions and then you must handle it in your
source code with a try catch block. So this
is a simple example. Like the exceptions
EA and EB are handled in method
two, the catch blocks are written by developers.
But actually when we do the bad code instrumentation,
when we throw an exceptions on purpose, in this method
zero, we could make the exceptions propagate to the
default catch handler. But we could also intercept
these exceptions in the middle of the chain, like if I try
to intercept the exception in method one and
catch and ignore this exception in method one, then we have these
two different behaviors and we could compare them.
Which one is better? This is the
core idea of failure oblivious computing.
It's possible to catch and ignore some of
the failures before they go into the default handler,
and some of them could provide more
resilient behavior. That's something we would like to confirm,
to verify, and to compare. Also,
we did the experiments with TTorrent and we do have these
three different categories of the perturbation
point, like the methods which
declare to throw exceptions. So we have lots of fragile
perturbation point. It means if we only inject one
exception, it makes your system crash or it makes
the system run abnormally
forever. We also have some sensitive perturbation points.
So if we inject exceptions,
only one exception in this perturbation point,
everything is fine. But if we keep injecting exceptions,
then your system gets stalled or get crashed.
And we also by default have some immunized points.
So it means the catch block behaves
pretty much well, pretty well to handle
these exceptions. No matter how many exceptions we injected
the tTorrent, the system still behaves
well enough to fulfill the user's request.
And then we ran this failure oblivious computing experiments
to intercept exceptions and compare the behavior
again. Then we could have the figures on
the right, like there is some improvement suggestions for
developers. So this is the second
idea I would like to share. Actually the overhead of the
triple agent is quite acceptable because we only
instrument the bytecode to inject some exceptions and
it happens only in these methods which
declare to zero exceptions. So the only
thing I could see is about the cpu time. Sometimes it causes more
computation resources. Okay,
and the final one. So I think I still have time.
So I could show you some simple command here first.
So this is a folder of a small demo of triple
agent. And usually when we run
some Java applications we could use this option
like agent path to define what
kind of Java agents you would like to attach.
So this part is, I think everyone is familiar
with this one. We just run the tTorrent application by
default like Java Jar file, and then we try to
download a centralized iOS package.
And then we attach the monitoring agent
and the fault injection agent with some parameters
like the mode is throw exceptions, so it
gets a list of the possible positions
based on the methods based on their throw keywords.
And then we could indicate the
location of the perturbations and the
default mode is coverage. So this means we just output some
information without really injecting exceptions.
We could change it to throw exceptions if we would like to conduct
But as you can see, there are lots of
parameters and it still feels tedious and
complex to do this experiment.
So that's why we also design some
come up with some ideas about the pipelines
and the automation tools. So for pops it's automated
observability for dockerized Java applications because now
Java developers also apply lots of stuff
in docker and it wraps their applications in the Docker image.
So first of all we conducted an empirical study about
the Java applicationlevel, like how
they use Docker on GitHub.
So we mined like 1000 Java
GitHub projects based on their popularities. And then
we found more than almost 600 Java
projects really use Docker in their source
code base. And we also analyze their docker files to see
okay, what are the most popular based images
in their docker files. So Java
eight, OpenGDK eight and Iopan ranks
first and second. And we could also see ubuntu here in
the list. So basically some developers install GDk by
themselves and some of them just use like Java
eight or OpenJDK eight which
have JDK installed already and
as a basic workflow. So developers
usually declare a base image like okay I would like to
use Java eight as a base image and then I add
more commands to write my own applications and build an image
so that I can publish it on the docker hub. But as
pops we could also integrate some of the features into
the base image. Then for developers they could just
replace the front line and to declare a new base images
which are augmented with this observability and fault
injection features. So we
integrate triple agent and other monitoring tools into the
base image and then we provide this base
image for developers so they just replace the frontline and
build a new docker image for their applicationlevel.
In this way it's also convenient that we could have two different
types of containers in a production like environment and
then one is just running normally, another one I could
turn on more monitoring tools and faulty injection experiments.
So we could compare this to the behavior of these
two different containers. And I could also
show you the pop system here.
So this is quite a simple dockerized Java
program which has ttorrent in it.
And I could also build a docker image
here so that I could run the
original image to download the file. So this is
a simple normal execution
of this application and then we could use the
augmentation tool to add this triple agent and
extra monitoring tool to the application.
Now it shows that, okay, we augmented the OpenGDK
file. Then it provides a new base images
called OpengDK pops.
So for developers
we could only replace this line like previously
it was OpengDK, and now we could use royal chaos OpengDK
pops to have these 40 injection features.
And now I'm going to build the
augmented base images.
And finally I would like to run the
augmented base images. So you see it's still like running normally,
but if I use this
one, you could get some extra dashboard. So this is another nice
monitoring tool called glueroot. We could monitor the GVM,
we could monitoring some application level metrics,
and based on this configuration file,
based on this configuration file, we could also monitor
the different paturation places like okay,
this location, like in which method, which class, I could
actively throw exceptions with
the happening rate and some extra parameters
to do that.
Okay, and as a summary,
so today I shared the nice repo with you.
Hopefully you will like it. And I also introduced three different ideas
in our research group. The first one is chaos machine.
On the try catch level it actively injects
exceptions in the try block. The second one
is triple agent. It happens on the methods level. It analyzes
the serial keyword and injects exceptions
in the methods body so that you could compare your system
under perturbations under normal.
And the final one is about the dockerized Java applications.
So it's a pipeline which augments the base images to
provide extra ability and photo injection functionalities.
I think that's all for today, and thanks for listening.