Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good day and thanks for joining me today for
this talk about open telemetry and some of the
things you may run into with working in an open telemetry
environment. And first, my thanks to comp 42 for letting me
talk about this today. Open telemetry is near and dear to my
heart. Having spent the last four years working in the observability space,
opentelemetry is definitely taking the world by storm.
But we're going to talk about how you can add opentelemetry to
a modern application. But to do that, we kind of need to start
by talking about what we mean by modern application.
Well, every company is on a cloud journey,
and they're moving from their old style,
monolithic, tightly coupled environments all
the way out to a cloud native or rearchitected environment.
And we're seeing this more and more driven by combination of
cloud economics as well as the ability and ease
of scaling, management and support.
But a modern application actually
is a little bit more than that. It's not defined by its
implementation, it's defined by its capabilities here.
And we think of these when we talk about modern apps in terms of portability,
scalability, observability, reproducibility,
as well as debugability. Last word is,
by the way, quite a mouthful to say so. When we look at this,
modern applications really have certain feature sets
to them, but they are not implementation details.
And so I don't care whether you write this in cobalt,
heaven forbid, Fortran eight and 95, it doesn't matter. What really
matters is that it works. And at this, you can look at
this list and see that certain things make up
characteristics that define this modern application space.
Here in this that we're talking about, this observability open telemetry
space, it really comes down to three quickly find
performance bottlenecks. So where are things not working the way we expect
them to do? After all, user happiness is heavily dependent on
performance as well as results, provide answers
to platform engineering questions. So how the people
are working in the platform aspects, as well as how the application itself
is behaving, and what can I find that I need to
find out about that? And then finally provides context on errors and
crashes. This doesn't mean simply reporting an error,
but it means finding the context, what's going on around the error,
what was leading up to this error. And so these things actually
lead us into this larger category
of things that are important for modern apps in this observability
open telemetry space. And so what I've
done is taken with my engineering team,
this thing called Mara modern application reference architecture,
and it's a microservices architecture. It uses Kubernetes
leader in orchestration, of course, here, and is designed
to be production ready. As you can see, the application
actually only makes up part of this. And the part that's
kind of hidden below the surface is monitoring observability management
aspects, infrastructure aspects, and Kubernetes itself.
The application is really only one part of this question.
The larger amount is the stuff that
the application is dependent on without being hidden
below the service. And so this
is a quick look at the application itself. This is
a banking application. It's written in Python
at the front end and uses Python for certain
pieces in the back end, as well as Java for other pieces in the back
end. It has two postgres SQL databases that
it makes use of. And so each of these pieces is an independent microservices
on its own right. But this does lead us to some of
the interesting things. We need a solutions
that has the ability to support more than one language,
and we need this solution that understands the context of to and
fro, where things are coming from, where things are going to,
and what the results in each piece are.
And when we started looking at this,
we built this so that it was portable. Remember that portability aspect here
so it can run on lots of different infrastructures built in
at this point with tracing metrics, logging,
visualization, management skills. We also built a
testing structure for load generation as well as
a continuous integration model. This mimics what goes
on in production applications every single day.
Maybe not completely complete or as complex as youll,
but this is what a reference architecture is designed to do.
In our case, we wanted the reference architecture to be more than just a reference
drawing. We wanted this to actually work and be able to test what this meant
and so quickly drawing it out. You can see the
various pieces inside of here, kubernetes driving our
node pod arrangements. Here we're using application
networking pieces. In this case we're actually
using the NgInx open source project to do
that. And then we have other pieces around it for automation, these pipeline
code repos which are on GitHub or GitLab, and then
automation code to drive all of this. And so there's lots of
moving parts across this entire application space.
Now, when we start looking at this, all those lots of moving parts
means that things have become complicated. Microservices mean
things have come complicated. Failures don't always repeat here.
Debugging can be really painful because those failures don't repeat here.
And the scale of data that we're dealing with is massive here.
So when we started looking at this, we sort of used this
kniven framework to say, what do we need to do to drive
to where we need to be able to probe, sense and respond the emergent
technologies that are being driven by cloud native or modern application
spaces. And so observability
is a data problem. More observable the data the
system is, the faster we can understand why it's acting up and we can fix
it. But in general, we look at this from
having a visibility across the entire stack, but based
on three classes of data metrics, traces and
locks, or do I have a problem, where is the
problem, what's causing the problem? And each of these pieces
plays a role in helping us look at that application stack.
But when we started looking at the needs for real observability
in a production application, we found that logs,
metrics and traces were a great place to start.
But we also needed things like error aggregation. We needed
to be able to look at the runtime state introspections,
we needed health checks, we need to be able to look at core dumps.
All these pieces play a role across this in
actually pulling the data out.
Today I'm probably going to spend most of my time talking into logs,
metrics and traces, but if you want to, this is an open
source effort. You can go through, play with it and take a
look at what the underlying structures are for all of these
different pieces and how we built them together.
And so that is a starting wish list. And as you can see,
technology with logging all the way over to heap dumps, and then we
listed some that came off the top of our head around where things
might actually make sense. So, elastic, the elastic
APM model, elastic cloud,
elasticsearch, whatever you call it. We also looked at Grafana,
Greylog, Jaeger, Opensenses,
opentelemetry, production apps,
statsD, Zipkin. There are a few others that crept
in and out during this discussion, but this is where we started.
And then we mapped across this and said, oh, does that offer
this capability? What we've quickly found
is that this chart is very
pretty and honestly kind of meaningless.
If you look at this, you can't easily compare something
like Zipkin to elastic APM,
even though they do have a crossover. They are very different structures
and very solving very different problems.
And so we decided that we were going to sort of stop
doing this and start looking at the qualitative views
of looking at different projects and see
what they could meet, not to this checklist items,
but to our underlying needs for these functionality.
So we actually started by looking at Opencensus.
In Opencensus at the top of the page,
interestingly enough, goes opensensus and open tracing have merged
into opentelemetry and we. Hmm, so open tracing,
open tracing project is archived. Learn more migrate to Opentelemetry
today. This kind of gave us the feeling that we should
probably be looking at open telemetry.
And so that was where we went to.
So what is open telemetry? It's a standards based
agents and cloud integration structure offering
the capabilities of observability. It has automated code
instrumentation, it supports multiple mini frameworks
and pretty much any code at any time. Remember that we are using
multiple languages and we're using multiple frameworks inside of here.
Open tracing and opensenses merged to form open telemetry,
so that we did have some backward capabilities for older
implementations. And as we'll find out
when we get into open telemetry. Open telemetry has also considered
how we work with existing applications in feeding
any of those classes of data into the back end
where we can do the aggregation and analysis.
And so the nice thing when we really looked at this,
is that opentelemetry provided the classes of data
we want, provided the conduit that we could use here,
and we could start trying to solve the specific
problems that we were looking for of how to reach an
observable state in a modern application made up of
microservices under Kubernetes control. And so when
we looked at this, we found a couple of interesting things.
First of all, tracing metrics, logs,
all of them are important for this observability space.
But we also had to worry about instrumentation.
How do we get the data out here? How do we see the
sdks, the canonical representations coming across
here? How standardized is the data structure?
And then can we use it any place we want to?
Because we didn't want to solve a single problem that
was going to lock us in to a solution, that we
couldn't change it to meet our ever changing needs.
And so opentelemetry actually did provide all of those aspects,
and this made it a lot easier for us to start looking at
the open telemetry space.
So when we started working with this,
some interesting things showed up.
First, let me note that this activity was
a point in time snapshot, and it is based on work that we did
roughly five months ago, maybe six months ago.
And so the rules may have changed.
We have not gone back and done the exercise to
constantly keep this up to date. We will be changing and looking
at future implementations around these development
of open telemetry, both the standardization of the specs for
data classes, as well as its evolving nature
of being able to transmit that data.
But let's start with logs, so everybody kind of
knows what a log is, something the system decides to write off
someplace and let us know what it thinks is going on for here.
But that very simplicity leads to some really
complex decisions inside of here. The simple side,
grab the log output. Sounds simple,
what the log output looks like, what format it is, what's included
in there, and how can I scraped the data to make
sense of it becomes incredibly important,
especially when we reach production level scales here.
So the complicated side starts to become how do we get
the data from where it is? And remember, in a Kubernetes
space, we don't necessarily know where it is. It could be lots
of places, it could be ephemeral, it could be elastic, shrink and expand
on things. Where do we store this data? We could have a lot of data
coming in here. Do we need to index it so that we
can actually search it easily for things
that we expect to have happen here, and then how long do
we retain it? Just those four questions alone
become a really challenging point
when we start looking at logs and production class applications.
To be really useful, though, those log files need to be easily searchable,
and they should be based on varying criteria. We shouldn't be limited
to just the indexing structures. We should be able to search
easily and efficiently on anything that we
may want to look at it. Developer SRE platform
operations we need to be able to find the data that we need,
when we need it, without a huge amount of overhead.
And so we can start looking at the various players,
the indexing and things like this. We're also, honestly, a little
bit, not to say it this way, but we
like to take advantage of things that people have
built that are efficient and work correctly. We also tend
to favor open source, and so those things led
us to our first viewpoint, and we did look at a number of
them, but we started with Elasticstack.
So Filebeat became our data transport from the
kubernetes demon set. So it's easy to put it into kubernetes.
It manages to put the right pieces where we need to have them go
into, and then we transmitted that from beats
into elasticsearch and then visualized in Cabana.
We used bitnami as the chart to
split that deployment. So we had an ingest engine,
we had a coordinating engine, we had a master and data node aspects.
The nice thing about Cabana was it let us do the
search as well as a bunch of preloaded decisions
that were made for us. The nice thing,
it works. But it turned out to be extremely resource hungry.
And the way that we had to deal with queries varied.
It was okay, but it did vary. If you
were doing something that was in that preloaded indexing strategy,
it was not bad. If you were trying to do something that
was a little bit out of their index limitations,
life became a little bit more challenging. And so,
as you can see here, this is one of the areas that we
can tell you it works, but it needs to be improvement.
One of the driving pieces behind this is that opentelemetry,
even though it has logged as one of the classes of
data, is in an early beta stage, and we
didn't quite feel comfortable at this point in time.
Depending on that emerging functionality, we wouldn't
recommend putting it into production very honestly,
simply because it's still in a state of change, in a state of flux.
Great specification, probably going to be finalized very shortly.
But then we have implementation details, SDK details,
languages themselves. All those pieces have to come into
consideration across that.
Now, when we looked at the distributed tracing, so distributed
tracing is where Opentelemetry started out of open tracing and
opensensus here it is complex. It can also be
semi chaotic, and we have some very
definitive requirements. First one,
it must not impact the quality of service for the application.
We've all seen it where log files get written off
in a batch format. Every hour I'm going to take
my accumulated log files and write these off to disk and
watched as we get a sawtooth effect of performance. Because the
system is now busy trying to dump its buffers for here.
We also needed to support all the languages of interest.
In our case it was two we had Java and we had Python.
We also had two frameworks, sprint boot, spring boot,
sorry, misprint for that, as well as class that
was going on for these things.
So we wanted to make sure that we accomplished all those
things. And what we found was something really nice. When we
looked at Opentelemetry, Opentelemetry had this collector,
this agent, if you will, that could pull the data,
and it had the ability to pull that data from any of
the independent services, but it could also in turn act
as an aggregator. So we could feed the open telemetry
collector from its point into an
aggregation model, which we could then hand off in
this case to Jaeger. So we had the ability to look at all
the different things that we needed to look at. It also let us look at
some unique capabilities. It was very simple to set this up.
We could roll data through very quickly and take
a look at what it looked like while we were in the development process
and be able to test what was going on.
And so here's an output of a trace. So starting from
Nginx we can see what the front end access logs like,
what the various fans took like. We can also look at what was happening
once we moved through this. So each of these pieces became
very easy to look at. This is a Jaeger chart,
if you're not familiar with it, but most of the distributed traces.
This acyclic chart model is something youll find very common in
looking at distributed tracing.
Honestly though, when we get into this space, it is really all about the
language. And remember we have two here. Well, we started
with Python and Python was actually really pretty straightforward.
We added a couple of files, we updated our requirements file.
We actually made use of Bunyan for JSON logs so
that we could make sure that our log formats were also
tied to it. And as youll see, one of
the things was also we took the logging
pieces and attached the tracing ids
to the logs where it was useful.
And that gave us the ability to do easy coordination across
those, to be able to backtrack the context from a problem space.
So Java on the other hand was a
little bit more challenging. First of all, let me point out that
simple Java Greenfield Java wasn't too bad. Pull the
libraries in, use the APIs, you're pretty much done.
And interestingly enough, when we
started looking at this, that was these kind of the model we were using
an existing application. And that existing application made
come life some challenges here. So with spring
framework it looked pretty easy. So we could
use spring cloud. Sleuth adds the traces and span
ids instruments common to ingress and nextpress points adds
traces to scheduled tasks and it can directly generate
Zipkin, another open source project like Jaeger that traces.
But at that point in time,
Autoconfig was a milestone release and it supported
some really old out of date open telemetry versions.
And we also had to pull from spring snapshot due
to coded dependency references. And so this made it
a little bit more problematic using looking at an old product set
using losing functionality. Remember this, open telemetry is under
constant development for this, as well as not being able
to control the environments, the code
repositories quite as easily as we wanted to for
this. Some of these things have challenges, but when I went out right before
this talk and still looked at this, it looks like they still
are listing older versions of opentelemetry tracing
and instrumentation at 111 and
112, I believe, whereas I think the current versions are about 14
or 15. And so need to be careful. When we
looked at doing this, how we could make use of existing
structures and what they're tied to became
a dependency that we needed to constantly check across.
So we needed to be able to work around
some of these limitations here. And so what we decided to do was we
built a common opentelemetry model telemetry
way. We're passing data across these, and this provided us the ability to
extend our tracing functionality and so we could build
auto confusion classes and add additional
trace attributes. There were certain pieces of information that we
really wanted to make sure were carried forward,
as well as certain things that we wanted to make sure we're constantly
up to date and shared. We also wanted to
be able to clearly see what the impact was going to be.
So we built an implementation of no
ops, so a no op implementation, production durage tracing
that we can simply flip on and off and so we can see what the
impact of tracing versus no tracing looks like here.
We decided to standardize our trace names. They're across
different languages, crossing different structures.
So we needed to be able to look and see how we did this.
And that allowed us to do this. We added an error handler
so that we could output errors to both logs and tracings.
This gives us the ability to coordinate back across those pieces here.
And then we actually added some additional tracing attributes,
a service name, the instance id, machine id. There's a few others
that are there here, but we also made sure
we're in a database world that we put the trace id into the comments
that preceded our SQL statements. It's really important
to know where your SQL statements are coming from,
what request generated them, all the way back to what the user started them
looking like here. All of those things were built
into an open source module, Opentelemetry Nginx
module, which you can find on our Nginx Inc.
GitHub site.
So the last part was the metrics piece.
And honestly we kind of skipped python.
It's not difficult to put metrics into python,
but the type and class of metrics we were getting from those front
end pieces was not really meaningful for what we
were trying to accomplish as part of this. And so we
then broke it down and said, okay, so what are we doing in Java?
So Java, the original code bank of Anthros
used something called micrometer and used connected to GCP
stackdriver. And when we looked at that,
we found that micrometer was actually a very mature layer for Java
virtual machines and the default metrics API in spring.
And so micrometer became near and dear to our hearts
very quickly. We also looked at the current
opentelemetry work for metrics. And at that point in time,
and actually I will admit still today, there are still
some discussions going on around what limits there are to
otel metrics, in particular, what kind of metrics are not
being covered, what kind of metrics are missing inside of these picture?
Open telemetry metrics are stable, specification is stable, the languages
vary. But is it complete enough of
what you want? Well, the nice thing
was that opentelemetry and micrometer together
were not a blocking factor.
So if you remember back that opentelemetry collector, the opentelemetry
collector can use metrics from anything. It can pull
them from Prometheus stats id, it can pull them from opentelemetry
line protocol OTLP, it can pull pretty much
anything. And that's the strength of the collector is the ability to
export things and collect the various data and
then process it in the middle. Micrometer supported
lots of actions. It already supported Prometheus, it already supported
statsd. And so these was obviously a natural
fit. We could have micrometer produce Prometheus and these
use the collector to coordinate and aggregate our
data together. The opentelemetry metrics SDK
allowed us to then send those metrics via
OTLP. And so we could use the micrometer model,
but use OTLP as the collection agent
of choice. From these we could send them to anything. We could send them to
lightstep, we could send these to Grafana, we could send them anywhere
we want to. Again, because we have that
receiver concept and that export concept.
The thing that we did ended up was that this is not a streaming model,
this was actually a tools model for the collector, and the collector
would pull the various metrics to be able
to pass them forward.
So without too much more, a quick
summary for what we're going on here,
and this is a more complete list.
So when we did this, opentelemetry was clearly
the right choice for dealing and production categories for
observability here for distributed tracing, we did
have a number of interesting pieces.
So Java used spring cloud sleuth with the Otel exporter,
leading to the Otel collector which then led to a pluggable store python
used the Python libraries, the opentelemetry collector to the pluggable store,
and Nginx, which was the driving heart
for this application space. The ingress controller for these
is not currently otelized, not currently
traceable for here, but we had an Otel module
that we can pull data in from NginX itself to the
collector. The metrics piece was
one of those that we thought was going to be easier, but actually
turned out to be easy with some caveats.
For this we used micrometer via spring to a
Prometheus exporter to the collector.
Likewise with Python we use unicorn
with statsd to Prometheus via the service monitor
out to hotel collector Python. We did not implement
metrics for and in NgInx we have a Prometheus endpoint that we
also passed to Prometheus and then logs.
We collected all the container log files through
filebeat, which went into elasticsearch and were exposable
via cabana. Is this perfect?
No, we would like to have a more centralized
source. We would like to be able to pull all these things together and have
them easily correlated without necessarily having to chose lots of
different tools with lots of different query structures to this.
But this was a demonstration of the fact that Otel
can become the underlying structure for any
of these pieces. So our summary
that continues out that I didn't cover here error aggregation
we use Otel to distribute traces for health checks.
We have a spring boot actuator through Kubernetes and Python
flask management endpoints fed into kubernetes for
this introspection. Again bring in Python
had the ability to allow us do introspection and then with the
heap core dumps, Java spring boot actuator let
us have thread dumps. And right now we do not have support
for this in Python. So the
TLDR for this three classes of
data metrics, traces and logs. All of these took different approaches
to get what we wanted, but we managed to get what we wanted out
of all of those. And the open telemetry collector is
the thing that made this actually work. The open television
collector is our friend, and it made it incredibly
possible to youll off what six months ago looked like a
very challenge project, by the way. It was
a very challenging project. The people that were
helping code this used to scream bloody murder at
various times. It's probably easier now, and we
will be revisiting this in the next few months just to see what's happening.
Metrics and traces had some interesting gotchas based on the languages
that we were looking at. Also, metrics and traces in
conjunction with logs needed to be correlated
and so we needed to make sure that we could have those trace ids
show up crossing into our log files so that we could trap them.
And as I said before, this was a snapshot in time. This was as the
state of the industry was about six months ago for
this. And things have changed. We have flat out will admit
that things have changed and we will be looking at updating this to
see how the changes impact our decisions. We're also going
to be looking at other solutions. Should we go to
Loki for log files or should we look at Greylog?
Or can we start making use of the nascent
Otel logging information? And by the way, there are auto
configuration instrumentation files and they are amazingly
cool. Drop it into your Java file and you get all these great
traces out. But they don't necessarily give you what you
want. And so when it works,
it's great and when it gives you what you want,
you can also take this for yourself. We built this into an
open source category. Take a look at
our repositories, play with the application itself
and give us some feedback on what you're interested in hearing
from or where you'd like to see us next. For that, feel free to join
our community and look for the Mara
modern applications reference architecture and give us feedback.
Tell us what you're thinking about this and tell us what your
world looks like with that. Let me thank you for
listening to me today, and again,
I hope you enjoy the rest of the conference. Thanks and have a good
day.