Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. My name is Ronnie Dover, and I'm extremely excited to
be talking to you today about open telemetry and coding with the
insights on and the reason I'm so excited about
this technology and the possibilities it presents is
because I think it represents a very big
change in how we write code.
And I'm just. In regard to my personal
background, I've been a developer for over
20 years. I've been a product manager,
and kind of oscillated between the two roles. It was
very difficult for me to stop coding about design when
I was developing or leave code behind when I was doing
feature design. However, throughout that time,
I was really fascinated by technologies that really changes the way we
code. And I think to some extent, we saw that when testing
started becoming more widespream, asynchronous programming,
event sourcing, there are a lot of technologies that kind of changed how
code got written, how we thought about code.
And I think observability today, and in particular, open telemetry
and continuous feedback, represent just such a
change. And I hope that by the end of this presentation, I will at least
kind of have convinced you that these things are really worth looking
at right now. So, to illustrate that, I want to
start with the story of one of my developers. This is
Bill. And Bill has been tasked with a very
common task for developers, which is take a feature,
develop it all the way through, and then deploy it into production,
something that is fairly routine for developers today.
Now, Bill's job used to be extremely
simple, and this was kind of the situation when I
got started around 20 years ago, which was, you build a
feature, you design the feature, you develop the feature, you wrap it up
real nice, and then you take it to the guys across the hall
and the QA department, and they start looking at it. There would be
some perhaps philosophical arguments about what's a feature,
what's a bug? And then eventually it gets rolled on into production,
and you would probably never hear about that feature again unless there is
some bug, in which case you may be called to correct
it. But of course, that's no longer the case.
So as teams become more cross functional, as developers
started taking on more responsibility,
Bill, as well as the rest of the team, started taking and assuming
more ownership. So now a part of Bill's job is actually
to also write the tests, to validate that there's a test
plan, integration tests, load tests, other types of tests.
Often, Bill needs to worry about how to deploy his
service. So do I deploy it using helm
terraform? What's the RaM requirement, these things that used to be kind of
once kind of the sole responsibility
of the DevOps or it people now
is everyone's to care about and to
know, because eventually there's going to be an issue and the person that might
need to investigate that issue could be Bill.
So a lot has changed. But the
question is, let's say that know
he's top notch developer, he's using the best tools
available, he has the best CI CD pipeline
in the world, and he's just released the feature into production.
So the question is, what happens then?
What happens? Or what should Bill do the moment that
he finished rolling his feature into production?
And the answer to that question is also interesting,
because my expectation from Bill was to
ask a lot of questions. I'm very kind of evidence based in how
I like to think about things. So my first
instinct would be, well, check whether your code actually worked.
Did it work well, is anyone using it?
I've witnessed enough horror stories where metriculously
written code was perhaps just a
few bad if statements away from actually getting
executed in production. So did it actually get run?
Did it change anything for the good? For the bad,
you just changed the data access layer and added something.
Did it actually make things better for everyone?
So that is my expectation.
However, 99% of the time, what would
happen in this situation is that Bill would
move on to the next feature.
And this is something that I tried changing.
So to figure out why this was happening,
I went back to the basics, and here
is some diagram that I pulled from an online source showing
the DevOps loop that by now is completely overused,
but it's still a good model to kind of think about the
process that releases go through.
And I would challenge you to look at this particular diagram,
and you may notice that something
is a bit off about it. So this
is a pretty accurate representation of the different stages of development,
from building to continuous integration, deployment,
operations and so on. But there is one segment here
that actually appears in the diagram that only has one
tool associated with it, which is salesforce for some reason,
and that is continuous feedback.
So although we have plenty of tools to
take our code across the chasm and into production,
to operate it in production and so on, we have very little
tools to none in this diagram that
can actually take the information back from production
and make it into something useful that we can use in depth.
So to think about Bill in this sense,
Bill has a lot of feedback when coding
in his local environment. He has at least some limited feedback
from testing. Limited, I say, because tests are usually
more kind of a red, green, black, white kind of a thing, rather than a
very qualitative way to measure improvements, let's say.
But it's still some feedback, but there is almost
no feedback that he can use in his day
to day from the production environment.
So I thought to myself, well, if only we had
access to instant objective data about
the code. Like, if only it was kind of a non issue,
that whenever Bill would want,
he would just glance over the edge
and kind of see exactly how this code is working.
And this is kind of the perfect segue to talk about opentelemetry.
So open telemetry is a spec, it's a standard.
It defines how to do observability. And there are lots of implementations
of that spec for different languages, platforms and so on.
And opentelemetry, in my humble opinion, is not
important because it is something amazing or revolutionary
in terms of the technology. Although it's a great technology,
it is important first and foremost
because everyone agrees on it. So the
fact that there is a consensus around opentelemetry
that we're no longer talking about kind of this fragmented landscape of
different proprietary agents, protocols,
instrumentations and so on by different vendors,
makes it very easy for two things to happen. First, for an ecosystem
to emerge, as often happens with open source tools.
So suddenly there are a lot of tools that are kind of coming together
and providing the value add of how can we
make this data actually useful and how can we analyze
it, and how can we take that data and
make sure that bill can use it.
The other aspect is in terms of coverage. So if I'm
a platform tool designer or if
I'm a maintainer in, I don't know, some major,
it doesn't matter if it's a backend server platform
or a web server or anything else,
the choice for me is very easy. Now I don't need to worry about,
well, should I allow instrumentation or enable instrumentation for
datadog or splunk or whoever it is,
I just support open telemetry.
And as a result of that, what we're seeing is that first
of all, many programming languages actually
integrate with it in a very, very easy way. I think net
even made it a part of the standard library just to make it
much easier for people to use. But also, it doesn't
matter kind of which tool you're using or what
type of project, there is a very good
chance that you'll find that there is already automatic instrumentation
available, which basically means that we can get data at
practically no cost about my project.
So for this particular example, I created a sample app
that we're going to use to kind of explore these
information pieces that we can now get about our application in
runtime. So because I'm a bit allergic to very
simple crud, apps know, basically just do
basic database operations. I created
a more involved application as I was watching the Harry
Potter movies at the time with my kids. I created an API for
the Gringotts vault, and I tried to
use a variety of technologies, in this case a queuing system, RabbitMQ,
fast API server, some postgres,
an external API with mock data, but all of that is
just details. The same would very easily translate
to any platform and any programming language.
And what surprised me right from the start was just the
amount of out of the box instrumentation for opentelemetry
that exists with all of these libraries and frameworks. And from
my experience, this repeats itself no matter what you're using.
So in this case, I was using FastApi, which is a very popular
Python server. It has out of the box instrumentation.
I was using RabbitMQ with a package called
Pica, which also has can instrumentation.
I was using SQL, alchemy, psycop, a lot of
different libraries, each of them already had
another way, or a very easy way to
instrument it and get data. So the
ramification of that is that I was able to get from an
application that has zero data. Okay,
as Bill, I would look at this application and I would start searching
in logs, trying to find clues, which would take me a lot of time
to a situation where the application was basically
spewing out tools of data about how it was
behaving. Now let's understand
what is the type of data that I'm collecting. So one of the interesting things
that Opentelemetry provides is called tracing. If you're
not familiar with tracing, here's a very quick
one on one on that. So a trace essentially describes a
flow within the system. So in this case,
in my application, a user goes to the fast API service,
let's say he calls evaluate vault
operation that gets translated to a message
queue that gets picked up by a worker that actually does
the work. That entire distributed operation
is a trace, and we can keep track of it and understand what
are the different sections or sub
activities there, how much did each of them take, how does it work
over time? So all of that information is very easily available.
And the other term that we use is a span,
and a span is just a subset of a trace.
So within a trace, let's say, within the segment where we're
making the API call. And before it gets to RabbitMQ
for the next phase, we have various activities
we have actually handling the request and then checking permissions,
maybe validating with some other authentication sources,
then in queuing the job. So each of these is an activity that we
can also track and keep tabs on.
Who called it, how long did it take, what errors did we have there,
what logs and so on. And that is what we call a
span. And in a sec we can actually see how we can use open
source tooling. In this case, we'll use Jaeger in
order to visualize that entire trace
so that my experience as
Bill will be upgraded. For all of a sudden I'll be
able to completely understand how my code is working
in the real world and maybe assess my changes,
which is what we were going for when we got started.
But let's look at some sample code because I
think that would illustrate it the best.
So this is the source code.
All of the links will be provided at the end of this presentation.
I'm looking at some basic operation like authentication.
So a lot of the data I don't need to change the code to get,
as I mentioned. So for example, the fast API service
already tells me about things that happened,
events, and keeps track of
the traces as they happen. The same goes with
the database and the RabbitMQ instrumentation
and all of these other pieces. Now, just to illustrate,
getting all of that to work was extremely simple. Here you can
see kind of the entirety of that code.
Make this a bit bigger. As you can see, it's basically
turning on the insights. Calling a specific instrumenter,
let's say request or Fastapi or postgres,
and then just calling the instrument method. And there are ways
to actually make that also automatic. So even that code
today is not that necessary. Just by including the
right packages, you'll have everything you
need about or all of the data that you need.
In addition to that, you can include within the
code specific manual instrumentations, which basically says,
I'm defining a scope and I want to track that scope in
the code manually. You can think about it like logging on steroids.
So it's not just a message here
or this code was called, but it automatically tracks who called
it, the duration, start and end. So for example,
here we see authenticating the vault owner with the key and we create a scope.
This is the python way of doing things. We call this scope authenticate,
vault owner and key. But of course there are equivalent ways to do it.
In every programming language to use. And there is ample documentation
about how to use it with opentelemetry.
Now once we have this up and running,
let's see how we can actually get data out of it.
So let's take a look at a quick example.
So here we have our application.
It is the API for the
greengrass vault that we're using. And let's trigger
a specific operation. Let's say we want
to trigger a vault appraisal. As we
mentioned previously. In this case we just provide the
vault id, we run the operation, we get back a result,
something happened. What? Who knows, right? And at
this point we can go back to the ide and maybe look at the code,
imagine what would happen. Or we can, as we wanted to
accomplish in the beginning, kind of get that immediate feedback
about, okay, what happens when this operation is called.
And here is Yeager. It's a very popular
open source tool that I like a lot that just allows you to visualize
the traces. It's very easy to set up. You just export data
to it from opentelemetry as a part of kind of
the boilerplate setup, which is very easy. And I won't go into it now,
but it's very well documented. So let
me see, kind of the latest data that we have about our
vault service. And immediately you can see this is the operation I just triggered.
And we can see that there are two services actually in this distributed
trace. One is the goblin worker, the other is the vault service.
And we can actually go and kind of explore the entire request.
And bear in mind all of this data I got just for free just
by enabling the instrumentations. So we have
here the HTTP call to the appraise. We have
here some database statements that are happening.
We have here some logical spans
that we declared in the code, like in this case.
And we can go all the way to see the individual
DB statements and we
can see that the goblin worker picked up the request here and what
happened there. So it's very easy to track and visualize
exactly what happens with such a request.
So this kind of impediment or information
gap between the developer and production no
longer exists once we have the code
being monitored in this way and all of that information is now
readily available for the developer.
And if you think about it,
we can actually make this information much more useful than just validating the
code once we've deployed it. So as you
might think, this is a loop. It's not like every
code that we write is completely new. We continually update code
and there's already very useful information about the pieces
of code that we're writing that could help us design it.
So if you think about it, even before Bill got started into his
Dell upgrade feature,
there are a lot of questions that if
he had access to the right data, he could actually ask,
who is using this code? Is it even used? Who will break
if I change it incorrectly? What are some issues I
should know about? What's the baseline I should compare myself to?
What should I optimize for? Where does concurrency happen?
And then later, when reviewing the changes, we can get
data from the test environments and start asking more
questions like, what should I watch for? What are some historical issues
associated with this code?
What can we learn just by the same observability
that we get by looking at the test. So there is a lot of data
here, and that data has the potential to completely revolutionize
how we write code, because it can be available at every
turn, not just when we validate our code changes, but also when we
design them. Because whether we look at it or not, the data
is already there. So now the question is,
can we open our eyes and actually use it?
But the answer to this question is that
99% of the time, Bill would still not use that data.
And I spent a lot of time trying to figure
out why that is the case and
why, despite my best efforts to convince Bill,
hey, look, there is this really awesome pile of data over there.
Why don't you look and see what you find?
Often Bill, or whoever developer it
was, would prefer to move on to the next feature.
And here are some reasons that I found. There are a few small
reasons, and I think one very big one. So the first has
to do with expertise. And it's by no chance that I put here a picture
of house repairs, because that's my personal blind spot and something that
I would procrastinate as much as possible rather than do.
And it's the same for many developers with domains that they're less
familiar with. For example, not all of us have brushed up on our statistics
101. And to make that data useful that I just showed
you, I actually need to know to remove outliers,
to calculate the median or the p 99
sometimes to do more complex statistics just to get to
meaningful conclusions about what does this mean
about my code. In addition, I need
to actually stop what I'm doing and start learning a new tool,
move between my ide and whatever
dashboard it is continuously, and kind of look for
troubles in simple terms.
But I think the more profound reason is
that it's not continuous. So in
the same way where we have kind of the equivalent and
symmetrical process of continuous integration, which is taking code into production
that happens continuously. You don't think about, hey, I'm going to run
some tests. These tests run automatically just by checking
in code. If you're using continuous deployment, you don't
think about, oh, I'm going to deploy to production. No, you just designed a
very good pipeline, and as a result of your
commits and merges and pull requests, everything will get deployed into production.
In the same way, if we want to make this information useful, it can't
be something that I need to think, oh, let me ask Bill to go search
for stuff after he did his check in. That needs to be continuous,
it needs to happen automatically.
And this is my own kind of personal journey with observability
and continuous feedback, because once I've noticed that, I became
really obsessed with the idea of how can we actually create
a continuous feedback platform, something that can actually
continuously look at the data that the application is
already collecting with technologies such as opentelemetry and try
to make that extremely useful for the developer.
Now I want to show you an example of this, and by the way,
I'm very happy to see that there are other tools and platforms
and ecosystem libraries, except digma,
which is the one that I'm working on, that are providing the same value.
I want to show you an example, not particularly to talk about Digma,
but to just show you my vision or what I think or
where I think development is changing towards and
what a modern developer might do in his code. That's very,
very different from how we code today. So to
do that, let me pull up that same code that we were
looking earlier. Let's look at this vault service.
And what I'm going to do now is simply enable
continuous feedback. In this case, one of the outlets of that feedback is
an IDE plugin that I'm going to enable.
Now, bear in mind, I'm looking at the code. I had no idea
whether it's good or bad or what's going on with it, but now I've turned
on these new spectacles, which are basically the information
that I get back from, in this case, digma.
So immediately I notice things about this code
and I can drill to know more. I can see that this
is actually an area of the code that sees pretty low traffic. I can find
an issue in this case, there's an n plus one query
that can be very easily identified by
looking at the traces I just need to do it. I can look
at the bottlenecks, understand who's using this code and so on, and let
me transition over to where this issue is happening. And I can
see where the culprit or the query that in this case
is repeating over 101 times in each trace,
I can understand who's being impacted by it. And again,
this time I can look at the trace visualization.
But sorry, in this case, it doesn't exist in this machine, but I
can see that trace visualization from
the point of view of the issue.
So instead of looking around phishing for
trouble, I'm kind of starting from the code, starting from
the example, concrete example of something that I found,
and then I can continue to explore and kind of
look for trouble in dashboards. But it's now contextual
to my work. So the vision is taking all of
that amazing information. And I know that
all of that information exists specifically
because whenever something is wrong, we go
and we dig deep deeper into logs and traces
and we find troves of interesting things that if only
we had known them earlier, we would have fixed them and make
that just a part of my coding, making it
so much closer to production. This code I'm looking at right now is already
running in production. And here is what those productions
or insights are telling me. And by the way, I'm learning a lot of things.
For example, I'm seeing that this code only
gets called in production and not by my local test. I can
also see that there is a code here that's never reached, which is
also interesting. So there's a lot of things that we can do just
by putting on these new spectacles that allow us
to understand how this code is actually running in production
and not just theoretically.
So how does one get started? With continuous feedback. This is a
completely new uncharted water, but there are a
lot of people who are also making really great forays
into this new and great methodology.
So first of all, I've created this web page which has
a lot of quick start links that you can get started with.
So if you go on continuousfeedback.org,
let me go there right now. I've included some
really interesting. Just 1 second.
Yeah, I've included some interesting links,
including getting started with opentelemetry. Yeager.
I talk a little bit about digma here, some example projects,
including the project that you just saw now with the Gringotts
vaults, which you can easily get started running just using
docker compose. Everything here is containerized and so on.
So that's extremely easy and something that I would recommend
everyone doing.
I think one of the more fundamental things that need to change is more
around culture. So in a similar way to what we had
when we got started with testing, for example, it was very
hard to convince developers that testing is a part
of their job. I remember having conversations with developers.
Tell me, what do you know? This is QA. Why am I doing testing?
And in a similar manner, I think that today we're kind of taking the
next step and saying, well, we need to own our code all the
way to production, and that's a cultural change that's already happening.
But I think embracing it and understanding what it means in terms of the ramifications
for me as a developer is something that we all need to
kind of learn more about.
If we don't use and harvest the observability data
we already have, then why are we collecting it?
I think that's the second really important point. I've seen
organizations that had amazing dashboards
for observability and they were literally, they could just be screenshots or pictures
on the wall. If we don't actually use them, make sure that
we're using them in practice,
then no point in collecting them, right? And if we're not
using them, we're also kind of creating a very crippled
process because we don't have any feedback loop between
what is happening and what we're doing.
Feedback is something that we need to implement in the process.
And in the same way that we have scrum rituals like dailies
or scrum of scrums, we also need to have feedback
meetings. So it needs to be on the agenda. And I'm putting
on my product manager hat here. If it's not on the agenda,
we're just going to be completely biased towards the next feature and
the next feature, and we're not coding to care about the feedback that
we're receiving and whether it's actually doing what we think it's doing.
So we need to have at least a biweekly feedback
meeting where we're discussing the features that got into production. What do we know about
them? What do we need to know about them more? This is the only way
that we can make it a part of the agenda.
So I'll be very happy to hear your
thoughts about this really interesting topic and also
to have you join the thinking. We have a slack group that's also
in the links that I presented here. You're welcome
to join it and share your thoughts. My contact details are
also there I'm always happy to talk about this topic.
This is it. It was a really great and amazing opportunity to
talk here on conf fourty two. And please do
reach out. Thank you.