Transcript
This transcript was autogenerated. To make changes, submit a PR.
Everyone. I'm Shakar from ground cover and I'll be talking about cloud native
observability at scale today.
A little about me I am the co founder and CEO
of ground Cover. I have many years of experience as an
engineer and an R D leader and during the different
positions I've being in, I've also been responsible many
times for creating and maintaining the observability stack at the
companies I've worked with, which is part of the motivation for setting
up and creating ground cover, which is a company that reinvents
Kubernetes application monitoring before
talking about anything, I just wanted to set the ground about the
three pillars of observability. They start out with
the most bottom layer of logs or log management,
which most these know and use today in production.
The layer on top is infrastructure monitoring. Basically figuring
out is your infrastructure healthy and is it healthy enough to host your applications
in production? And the upper layer, and perhaps the most
intricate one and most related to the business you're trying to create,
is the application monitoring or APM application performance
monitoring solutions. APM is super
critical to monitoring what your business is trying to achieve,
since it correlates really deeply with the targets you
set for yourself and for your customers.
Just to describe two of the critical missions APM usually performs.
One of them is trace captures, actually capturing the behavior of the
API driven system. You're actually burdening production. Most companies
today are based on a microservices architecture which is
heavily app driven, and by capturing the
different traces between these microservices,
teams can troubleshoot, figure out where things starting to break or
behave differently than expected. And the other one is highly
related to the SLOS the business is trying to maintain.
And that's application metrics or golden signals.
In these application metrics you will find things like throughput or
requests per second, or error rates or latencies
of the different APIs which measure how fast or slow
your application is actually doing its task
facing your users I'm not going to dive
in too deeply about why APM is so important, but this
is a quote from the Google SRE book just saying that if you could measure
only four metrics about what your user facing
system is actually doing, you should focus on those four golden signals,
since they can highly predict if something is starting to
drift or break in your production. And that only
goes to say how important an APM is in the
stack that you're maintaining inside the company with regards to observability.
So if APM is so important,
it must be adopted really well in the industry,
right? But reality actually says completely different.
I mean, looking at statistics, slug management solutions
are highly adopted in the industry. I mean,
with almost 70% of teams actually using some kind of log management
solution. Infrastructure monitoring not following too far
behind, with almost 60% of teams using
some ways to monitor their infrastructure. Yet application monitoring,
or APM, is clearly under adopted with only
22% of the population actually using an
APM solution of some kind. And it
raises the question of why this is happening. I mean, if APM is so
important, why is it so under adopted in the industry
as we know it? And to understand the
answer to this specific question, we have to dig much deeper
into what an APM is and how it is built. There's many
reasons behind why apms today are under adopted.
But in this talk we're going to focus on one, one specific and painful
answer to this question. And it is the final actual
understanding that apms don't scale well with modern
architectures. The reason that they don't scale well, just to
jump for a second all the way through to the final understanding of
what we're trying to say, is that they stored tons
of irrelevant data just to get you the little
insight you need about what's actually going on in production.
Monitoring so much data basically turns cost to be
unbearable. So teams can't really scale as their business is
growing and their customer base is growing. Can't really scale with
the APM solutions and maintain them and use them in production.
But another question is starting
to emerge here.
How did we get here? How did we get into a state where an APM
solution is storing tons of irrelevant data just to give me the
little information I need to monitor my production and
to understand that we have to kind of go back in time,
ten or 15 years, back to when apms started
as they are today. The solutions dominating the APM industry today
have been formed over a decade ago, and these are based on
a centralized architecture, which we're going to talk about in a second.
And the tale of these centralized architectures and the architectures
decisions the APM vendors made over a decade ago
have led us to the point where we are today, where apms just don't fit
to the current modern scale of microservices architectures.
When things started out basically in observability, you had
an infrastructure agent, which is usually an external problems of
some sort, monitoring the infrastructure as these bedding of
your running applications in production, figuring out what the infrastructure is
doing, provider metrics about the infrastructure. But when APM
providers started to try and monitor applications of what these applications
are doing, they had to turn into a different solution, which was
much more heavily based on instrumentation
inside the code. And instrumentation means that to monitor
the application I gave, to monitor it from within,
I have to give the R and D teams or the developers
some kind of SDK that they can instrument into their code,
basically integrate into their code. And this SDK
would allow me to monitor the application from within, running as part of the
application, giving me the opportunity to suddenly
measure things like a latency of a specific API or the error
rate on a specific API, things that I couldn't have done with
an external agent up until that point.
But what exactly is instrumentation?
I mean, what does it actually mean? This is a sketch of a typical microservice
kind of behavior. A production
system based on this architectures would be very API driven. You would get some
kind of request from the outside world, from the user,
some microservice would handle it and pass it on down into
these different microservices inside your production, taking care of
the different business logics you're trying to create, eventually returning some
kind of response to the actual user that triggered
this entire flow. But instrumentation means that you're actually
injecting or inserting a
small monitoring code that wraps this entire API handling
behavior inside each of the microservices. For example, to monitor
the latency of a specific API, I would have to wrap it with
an external monitoring code given to me by the observability vendor to
actually time the start and the end of
the handling of, say, an HTTP request flowing into my
web server. But this specific
choice of actually using instrumentation is part of the reason we
got to the point we are at today. I mean,
when APM vendors chose to invest in SDK
and instrumentation of code by the developer teams,
basically they did it since this was the best technology to achieve
their task over a decade ago. This also
creates a lot of different meanings into
what the architecture actually looks like. If I'm using an instrumentation
and I'm inserting my own code into my
user's application code, it has to be very simple and
has to be very fast. And the reason for it is that, for example,
you wouldn't expect your web server efficiency to decrease by 50%
just because you're starting to monitoring it with an APM,
right. It means that the
code I'm wrapping my
actual application with this instrumentation code, has to be very
fast to not create bottlenecks or slow down my applications.
This means it has to be very simple and create very simple decisions
and run very simple algorithms that basically make it
efficient, fast, not error prone, to not create
any bugs or crashes that can also endanger
my application. And this means that
eventually, instead of putting a lot of weight
into the sophistication of this instrumentation
code, I'm basically taking all the responsibility back to
the APM vendor. This is the basic architectures behind the centralized approach
that APM vendors chose over a decade ago, which means
let's make the SDK as simple as
possible. It will be fast, it will be free
of errors, it will be very simple to understand and let it sample raw
data and just send it back to the APM vendor. And all the crunching and
all the complex algorithms and all the intricate
decisions we have to make as an APM provider would happen on
the backend side of the APM vendor. Things like capturing specific
traces, things like creating span based metrics, these golden
signals we're talking about, we're going to create them from the actual raw data
sampled from the customer side on the AMPM vendors
backend side. But this decision,
even though very logical at the time, created a major
depth towards scale, and this is part of what is starting to
shift currently in the industry. Basically what it says.
It says you have to store raw data to get insights. So if
we look at all the green rectangles here as healthy request and
the red one as a faulty request, you would actually store these spans
of the different requests when you wanted most
of the time, just these digested insights around
them, like the golden signals, the metrics that depict their
behavior in production. You can imagine that if you have a million requests
per second on a Kubernetes cluster, for example,
if you're a company working at high scale, you wouldn't want
to store too many spans and pay for their storage just to
get things like the latency on your different microservices
and things like that. That would be the entry point to
figuring out if your production is doing okay or not.
And the other one is that by making simple decisions
at the instrumentation code at the SDK side, it will also mean
you would store data at equal depth. Meaning that, for example,
if I want to make a decision of what should I store in different cases,
this wouldn't happen with a common APM today. So for example,
imagine I want to get the payload of the faulty red request here.
Since I want to troubleshoot with this payload, it will usually mean that
I would have to store payloads for all requests or for none of the requests.
But clearly that's not reality, right? I mean, not everything
is equally important. If that faulty request comes
one in 10,000 requests you would gave wanted to get as much
information as you could on this specific request. Like know
logs from the different pods communicating around this request or
other requests flowing through the system at the time to figure out if there's some
kind of interdependency. But on the green healthy requests,
you probably want to capture much less and pay much less
for all this data being stored at the APM vendor side.
And this trade off is exactly what we define as the visibility indepth
cost tradeoff. It means that if you want to go deeper in the cases
you care about, like for example getting as much
information as you can about that faulty request, you would
have to pay increasingly high
amount of pay by storing
increasing amount of data across your system. Since you can't really decide
what to store and what not to store given the different cases
you care about in production. And we all know that from intuitively.
Every developer knows that from the different experiences
he or she had in handling
some kind of observability stack in their production environment.
We know it very well from logs, right? We store,
say warning logs level and up. Since we couldn't
store every trace, debug or info messages flowing
through the production, it doesn't make sense, right? But imagine that you could
get the info or traces logs from the web
server on 10 seconds around any faulty request.
Wouldn't you want to do that? I mean, clearly there's different depth of information
you would want to get in different cases. We also know it from traces really
well. I mean, observability vendors or APM vendors allow us to randomly
sample our production to reduce the volume of data we capture.
That would mean we would capture, say one in 100 traces,
but that's not exactly what we want. We want to capture the specific traces we
care about, like the ones that take ten x the time of other traces,
and not just randomly sampling the production.
So centralized architectures is being replaced
today by something which we call edge
observability. It's a terminology that's starting to
be very common in modern solutions today.
Basically it says that the centralized architecture
has reached a point where it doesn't make sense anymore, it can't scale well.
It doesn't make sense to maintain and pay for the data being stored by this
approach. And basically the weight
of the decision making and the sophistication is being moved
from these observability vendor side to the actual agent
or SDK, running at the edge, close to where the data
is. And that's the future as we see it. And before trying
to kind of describe what ground cover does in this specific approach
of how to actually monitor with edge observability,
I want to talk a bit about EVPF as an enabler to all that.
EVPF is a technology that's basically a revolutionary
technology, being more and
more prominent in the last couple of years.
It's a very interesting technology for many different
use cases, from networking to security, to monitoring and observability.
Basically what it says, without going too deep, is that you can run
complex business logs inside the Linux kernel, and that
allows you for monitoring, for example, to monitor all the user space,
all the applications running at the user space, without actually
being part of the code. You can, for example, get things like golden signals
or things that you would expect from an APM that had to be very
tightly integrated into the code itself by just running an external
agent at the kernel level and capturing the same data.
So EVPF is a really interesting technology.
But these reason it's so interesting to edge observability
is that it's suddenly allowed to achieve what an APM would
have achieved only with instrumentation deeply
into the code. Suddenly with an out of band agent that
just runs alongside the application without being part
of it, that suddenly opens up two interesting
use cases or ways to think about what we can do. It means
that everything we do inside this agent wouldn't create an overhead on
the application. So for example, if we make some kind of complex decision
or run some kind of complex algorithm inside this agent,
it wouldn't necessarily mean directly that we may slow
down the performance of these application we're trying to monitor. And that
means that we can make distributed decision making
in this agent, which is much more complicated than before.
If before we had to keep that SDK as simple as possible,
suddenly we can put much more sophistication
into this agent, since it's running completely out of
band and not impacting the application it's trying to monitor.
And this is exactly where things started to get really interesting and
what you can do. So if the
first thing we talked about was I don't want to store
raw data and pay for so much raw data, to get
a simple ingested metric like latency on my specific
APIs, we can suddenly move all these sophistication into the
EBPF agent. Imagine an API flowing into
your application at a very high rate.
Instead of storing or sampling these spans and sending them back to the APM
vendor. We can now create the metrics inside
the EBPF agent on the fly without storing the spans
or tracking them out of the node or the actual host
at any time. And that allows us to store so much less
data. To create the actual insights, teams need to monitor production
like golden signals and application metrics.
Another interesting approach allows us also to create variant depth
capturing. So imagine I have again this API flowing
in at these high throughput into these application. I can
decide to do very different stuff in different scenarios.
For example, if there's a bad status code, let's imagine it's an HTTP
API for a second. So if I get a had status code returning from the
server like a 500, I may want to decide to capture these
full payload of this specific request so I can troubleshoot and store the
logs a few seconds around that request from the different participants
that participated in this API call. But perhaps
if it's a high latency event like an HTTP span suddenly
taking ten x the usual time, I may decide to do
different stuff. For example, I may want to stored other
requests flowing to the server at the same time to figure out if there's specific
API load from different other APIs on the server.
At the same time, I may want to store the cpu usage of this specific
HTTP server at a very high resolution at the time of the request
to figure out if there was some kind of cpu spike that eventually caused
this slow response. And it also opens
up different behavior for the normal flow, right?
In usual APIs in a common production
environment, the normal flow is clearly very prominent and most
spans are healthy and describe the normal case.
So I may want to do something completely different. I may not
want to store even payloads or logs or whatever
at this specific situation and just sample 1%,
say of the normal flow to just allow the users
to see actual healthy spans and how they look in
a normal flow. This variation opens
up a lot of different ways to store
data based on the different scenarios the user would care
about or the development team will care about.
And that breaks the trade off of storing too much information
to create the little insights you need. You can figure out what
scenarios you care about and store much more information about them,
while storing much less information about other cases which are not that
interested in figuring out the details about them.
This may sound like theory of trying to take the
centralized approach into the edge observability approach and
do things differently to reduce costs and to make things much more
scalable. But this is current reality and this is the current
reality of ground cover and how we're built. As an APM vendor,
we're based on exactly these two assumptions
or approaches that kind of bothcentralized the way an APM is built.
One is that we use an EVPF instrumentation instead of an SDK
instrumentation inside the code. Basically what it means is that first we allow
at can immediate time to value because we don't require any r and D
efforts in the process integrating our solution into their code.
But we also support this out of band deployment, which we
just talked about, which allows us to take complex
decision and create a sophisticated EBPF agent that can
carry most of the load of these observability at the edge side where the data
is actually at. And the other one is that
we're creating these edge based algorithms to create
metrics to sample the data smartly so they can be
built for scale. So we can scale very well with production environments as
they grow into hundreds and hundreds of microservices,
breaking the trade offs of how deep we want to
get these data in specific cases compared to the overall
volume of raw data that would be captured to kind of
describe the production and let users figure out if their production
is doing well. This is a really interesting
discussion, and we think the industry is definitely shifting
from one approach to a completely new modern approach.
We would love to answer any questions or keep
talking about it in different discussions. We find this topic really interesting and important
for the entire observability community. Feel free to join our slack
and ping us at any time with questions or ideas you have
or experiences you have from monitoring your production with
the scale you're used to today. Also,
feel free to contact me directly and thank
you. It was a great, it, it was great to
talk about this topic and share our thoughts with you. Thanks guys.