Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Brian Loomis and I'm with Progress Chef.
I'll be speaking with you today about the practicalities of using open
telemetry and telemetry in general to provide feedback given business
metrics and on operational systems.
We've built a prototype, which I'll share with you, based on
some specific needs of operating a software as a service, SaaS service.
Logging and tracing fine for small numbers of customers and for direct
developer interaction debugging, but they have some limitations when it
starts getting into larger systems, and especially with regard to understanding
customer behavior within the context of what they're licensed to use, and
maybe some limits on those licenses, and with automating the scalability
of large clusters of services, right?
So when we get really big Those logs get similarly big, and they
actually can't really drive as much sort of high level behavior.
recognizing that OTEL and OTEL like systems have some press out there
indicating they're hard to implement, we found it pretty approachable, and we'll
try and give you some idea of the specific types of metrics we're trying to collect,
how we're using tracing, and events to make those actionable and to achieve
superior performance on our services.
Let's take a look at what's coming up.
There's a quick agenda.
We'll talk about topics in roughly this order here, a little bit about
traditional one way telemetry and some of the challenges we're hoping to overcome.
We'll talk about our design with feedback loops included.
And that's a key concept in there, that the telemetry is going to
drive some other services and other events elsewhere in the system.
Our example application, how enterprise systems work and what we need from a
licensing business metric perspective to show back to the customer.
Or our tenant in SAS, and then we'll wrap up with some final comment.
The data we've collected through telemetry is pretty important in driving two
very important pieces of the business.
One is obviously revenue, and the other one is customer satisfaction.
Those two are.
Pretty tightly coupled, right?
Our solution, the Chef 360 system that we're putting this into is a
multi tenant SaaS application, right?
Meaning we have multiple organizations and each of them may have one or more
tenancies or I would say operations and data subdivisions to them.
And the whole solution operates on a consumption based billing, right?
So truly as a service that you order up a certain quantity of.
Operations that you want to achieve.
Fundamentally, we want the SAS customer to dynamically change their demand
proportional to where they see value.
In DevOps, this means going from a limited prototype, evaluating
functionality and maybe some candidate workloads in a limited capacity, to
increasingly broader use across the enterprise, eventually having the most
critical systems under management.
Often this is tens or hundreds of thousands of machines to us.
The customer buys in incremental workloads of the operations they perform.
Seeing value in new cases as the system adoption grows, as they add more machines,
as they add new types of jobs in our case.
This presents unique challenges to the standard MELT model of how we
traditionally look at logging, right?
Events are usually tied in that model to very small, smaller than business
increment type operations, right?
So maybe an API usage, but not necessarily a single operation
that the customer is paying for.
Or they're very strongly correlated with the logical implementation
and not the business value.
Meaning that sometimes we're tracking events and logging based on an API.
But that's not really what the customer is paying for.
We may be tracking things on.
A CPU basis or a disk space basis, but the customer may be
ordering a very different units.
Tracing similarly is constrained often to non production debugging
because it's so verbose.
And even that logging is often too voluminous to support any actionable
insights for either the service provider, our team, or for our customers.
And the final issue that we were facing was really that We're not
being responsive enough at scaling for large numbers of tenants, right?
We'd often have to have them wait for a little while until we provision some more
hardware or whatever it happened to be.
So it's, there's a delay between when they want to scale up their demand and when
we actually provision it to address this.
We developed a telemetry driven approach that provides visibility
into the license features and the number of operations, maybe the demand
quantum that the customers ordered.
And that enables us to dynamically scale our workloads, and it also
enables us to bill very accurately.
So I'll keep going here.
This is a typical, maybe an enterprise application life cycle.
And certainly true for enterprise grade.
SAS applications, right?
A customer is going to purchase a tenant, a tenancy in the
application through maybe a license or a transaction in a ERP system.
They may install the app, it may be hosted in the hosted SAS service,
or it may also be available on prem.
It may be in their cloud, right?
Through an AWS marketplace or an Azure marketplace, or maybe in an air gap
environment, even further removed from maybe internet accessibility, right?
Now, for your application, you want to have your application behave
the same way everywhere, right?
And so that's going to be one theme that we see coming through this is that our
telemetry structure is exactly the same, whether any of these cases are true.
If the application is multi tenant like ours, then you then.
There possibly is more than one tenant per organization, right?
A customer may have a tenancy for maybe manufacturing in Asia and one
for manufacturing in Europe, and they may be separate, possibly the same
org, but also more likely they're going to be multiple tenants in there from
different organizations that need.
Really significant privacy and security walls between them.
And so one of the keys also is that our telemetry can't bleed through,
from one customer to the next customer.
And maybe not even between one tenant and another tenant within the same customer.
Once provisioned, each tenant is going to have users which
perform their operations, right?
The license may grant them a certain quantity of these, an as a service, right?
You're ordering on demand in billable chunks, maybe, tranches,
or maybe by feature set even.
Whatever they'd like to purchase, right?
They may want this feature, but not this other one.
And that may change over the lifetime of their license, right?
It may be updatable.
If you're thinking in app purchases, yes, that's certainly true.
Probably bigger things than Candy Crush power ups, but maybe more like
DevOps minutes in our case, in Chef 360's case, by the thousands, right?
So you may order in chunks of big computing power.
These are some of the events we want to capture, right?
We want to know what these are, and we want to translate them into
metrics to show usage over time.
We also want to know how the system, the underlying system, is performing.
When this increased load gets put on, so we're going to take metrics
on the average load and that could be as simple as communities metrics.
In more complex terms, we call these aggregate events, and they inform
our customer usage specials, goes back to the customer, tells them,
Hey, you're licensed for 500 units.
You're getting close.
We're going to throttle you down as you get closer to that.
Maybe you're at 95 percent of your capacity and you're
going to hit a limit there.
we'll also show that back to the customer on a dashboard that Brings
these hotel events and metrics back to them for full transparency
control from their side, right?
We want them to see the workloads in business terms, right?
We want to see it in units of measure that they're buying, not necessarily what we're
charged as the service provider, right?
We may be billed for CPU minutes or disk space or something like that,
but the customer is actually buying something that's different, right?
They're buying a capability.
In business terms, right?
There are buying DevOps minutes or number of jobs.
We typically build and show 360 biological job running on a customer node, which
we call the job node, strangely enough.
okay.
And that's our per instance measure that they can scale up or scale down, right?
They don't necessarily care how much, how many EKS containers
are running in the background.
We want to abstract our platform cost effectively from what the customer is
actually buying as a service, right?
So why don't we do non Otel type solutions, right?
Why do we go to the extra trouble of adding Otel in here?
And so our typical scenario that we're dealing for when I say like enterprise
grade solutions, imagine you have a customer that has 100, 000 containers
that only last for a few minutes each.
It's a very dynamic environment, scales up and down very quickly.
some of those particular nodes that we would call them, execute.
Very quickly, and then they disappear, right?
They get dehydrated.
They're ephemeral, as we call them.
these are some of the questions that we realized that we're asking ourselves
as we're going down this OpenTelemetry journey, and why OpenTelemetry?
You think of like how many times you've just, shipped a system to production
that has had debug tracing turned on.
You're like, just in case something goes wrong, I'm just going to turn
on all the logging and we'll scale it down if there's, a need to scale
it down and just take less logs.
But, we're designing for the doomsday scenario where we have to have all
the logging all the time, right?
And, this gets worse obviously with distributed systems because you have.
Uncorrelated logs, right?
You have to bring them all together somehow that may have come from different
time zones and they're interleaved, might call this data dog diving.
I don't know.
He's making that one up.
But also we have to realize that someone's going to have to look
through all these logs, right?
So Even though we may be taking in a lot of trace information, a lot
of debug on every API call, on every request, every response into the system.
Fundamentally, at the end of the day, your developers are then
your tier three support, right?
If something goes wrong, the developer is going to have to look through this massive
data and try and intuit something better.
I would also argue, without a, an intelligent assistant on that or some
way of parsing that, that becomes really Unscalable really quickly.
Imagine you have a hundred customers that are feeding into a SAS service log.
figuring out where that one customer was, when they had an issue is looking for the
proverbial needle in a haystack, right?
How many times have you just bolted on events, every time
an API is called, right?
Every time an endpoint gets hit.
you may be logging an awful lot of data, but you're also probably
logging the request and response.
Do you go through and actually take the time to, apply privacy
filters to that export right before it gets out to maybe the service
provider or something like that.
And in many times when we go, Oh, customer, you just go ahead and collect
all the logs and then you implicitly tell us that we can see everything in
those logs and you just ship them to us, you zip them and FTP them over to
us and we'll take a look at them, right?
It's a very asynchronous process, but it also doesn't provide any privacy.
And in a multi tenant environment, this is very important, because we're going
to want to do things like auditing events and show the customer who had access
to do what thing at what point in time.
And we're looking at a multi tenant log scenario, unless
we've parsed them out by tenant.
Another one we run into a lot is how many times do you then go, okay,
I've got this big body of log files.
I'm going to have to put a reporting solution on top of it, right?
I've got to build a dashboard.
I've got to do all this stuff.
And then you realize, Oh, wait, there are tools out there.
And other people seem to be able to do this a lot easier than I'm able to do it.
And then you go, okay, what if that log file isn't actually answering the
question, the fourth bullet on here, what if that telemetry is not useful to the
rest of the people in your organization, it's useful to developers made to
troubleshoot a particular debug instance.
to find a bug or something like that.
But what about the next time the customer comes up for a licensing discussion?
And you want to be able to show them, Hey, you've been using,
80 percent of your allocation.
Do you want to buy another 10 percent or something like that so that
you have enough capacity to grow?
We're not collecting those metrics.
What we're collecting is a lot of very low level APIs, right?
And this drives some very particular customer behaviors.
I put a couple in here.
The customer may overbuy, in which case they're spending a lot of extra
money on their solution, which kind of looks like shelfware to them, right?
And they go, Ooh, maybe, when budgets get tight or something, maybe I actually
go in and evaluate like how much I'm spending on this solution over here.
or they have, Even worse behavior that they limit themselves and they go, I
don't want to go back to procurement and add another quantum of things.
And so they're taking themselves like they're constraining their own
value because they can't see how much they're using that solution, right?
And they don't want to go over because that would be very
difficult organizationally.
And they'll not even ask you the question of could I get better service, right?
And so both of these, when I say the business metrics get impacted and
Your revenue as a service provider gets impacted because the customer
is holding back from the conversation you really want to have, which is,
can I pay for just what I'm using for?
Okay, and then in the last one, it's more of an operational concern.
we call this the horizontal pod or auto scaling only gets you so far.
What do you set your pod size into?
If you're adding multiple customers in there that are
variable sized and may get bigger.
Over time and may eventually either themselves or the combination of
these customers may exceed the reasonable size of the cluster, right?
You have vertical pod auto scaling.
You can add more physical nodes.
You can get monolithically bigger.
But how do we actually get real auto scaling that can burst out?
Here we're sharing the architecture diagram behind our solution,
showcasing a multi stage telemetry processing basic dashboard.
On the far left side, you'll see the license service, which talks to our
tenant provisioner service, which then allocates capacity in one of many,
potentially many Kubernetes clusters for the new tenant based on current usage.
for requested usage.
If the customer gets larger, too large for a single cluster, we want them to
burst the next available cluster and have that automatically be created for them.
Telemetry is captured identically on prem as in our SAS service.
Multiple clusters will feed into a local telemetry dashboard for
a single organizational customer.
This is then optionally anonymized and fed to a global hotel collector
operated by Chef and Progress, which allows our organization to monitor
any on prem clusters, which opt in and all of our own SAS clusters.
There's three things to note with this.
Really, the first one is the lines backwards going between the local
collector and the provisioner allows for feedback when the cluster gets full
at capacity and causing a new cluster to be spun up and new workloads then
to be allocated to that new cluster.
Cluster number two, for instance, and then the second piece of feedback there is the
global collector can send a signal back.
Overbilling to tell the customer they're actually close to exceeding
their contractual limit and see if they want to upgrade their licensed usage.
The third thing to note here is that basically we have the same tiered
pattern for any customer type.
Our home chart does not change if we deploy on prem in AirGap or in customer
supply cloud or cloud hosted by us.
It's the same exact pattern.
The customer would get on prem everything up to that orange local
customer telemetry dashboard.
And then we would still have some of the global services wrapping this, but
otherwise they get everything in the middle of this picture and they operate
exactly the same way and they can see all of the telemetry that they want to see.
This sort of talks a little bit about the sample application that will be out there.
This is just a sample hotel implementation.
It's roughly equivalent to the prototype that were, it shows all the items that
are in the diagram just before this.
And.
Really, the differences here are in the bottom bullets there, our
application, our main application is a bunch of scratch containers that
are written for AWS EKS, they're in Golang, and so our hotel implementation
is in Golang, not in, not in NET.
NET seems to be an easier one to present to people and has a really good
wall and everything, you'll be able to download the code samples from the
repository and the references here.
Just to quickly go through this, there's a client tool that
asks for jobs to be submitted.
it basically runs for multiple customers.
It simulates demand into the system.
the ones in parentheses I'm not supplying, they would typically
be part of this service, but a routing service or a load balancer
basically that takes those clients.
request and puts them in the right clusters where that
job was deployed, right?
Obviously we'd have to talk to the provision service to go, okay, where
did customer with tenant A and this particular workload or license go to?
oh, that's over in cluster five, right?
So that routing service, we leave as a exercise to the reader.
We have the job service.
which takes client requests and may have some limits implied, right?
So it goes, okay, you are only licensed to five operations.
So we'll cap you at that point.
And then we will not take requests after that point, but it also has
its own internal capacity, right?
So that once it gets full, it's expecting that the load balancer or
the provisioner is going to redirect all that load to a new cluster.
the cluster management service itself, and we just talked about that, which
will provision new clusters when once at capacity or near capacity.
And then finally, on the outside, the observability platform that you might
bolt on to your telemetry service might be a Prometheus or a Grafana
or something to visualize this so that an operator can go, okay, what's
actually going on in my cluster?
Might bring in the Kubernetes metrics as well, CPU, disk, network load, things
like that, as well as the business metrics that we're getting from the hotel side.
Coming over here, I wanted to show you just a quick, animation here.
And I'll see if I can play this video here.
I did this in chat GPT.
So apologies for the, the simple nature of this and everything.
I put the prompt in, the repository as well.
So you'll see that out there.
If you want to do your own animations and things like that, this basically
shows a red and a blue customer.
Each with a limited number of jobs that they have.
eventually the starting cluster gets full of requests and
bursts out to a second cluster.
And you'll see that blue eventually goes, Oh, wait, I'm limited by number of jobs.
I don't want to have to wait like red customer did to start my next one.
I actually just want to go ahead and bump up my license and you'll
see all of blue's jobs then come in.
So I'll just play this quickly for you here.
The red has to wait for you to submit his next job.
Red wanted a total of eight, so it's done for right now.
Let's see what blue does.
Ah, blue asks for five more.
Has a little more space in there.
So his are just going to go into cluster number two, which is a little bit bigger
size than maybe cluster number one was.
And there we go.
Blue is done.
There
we go.
So talk a little bit here about the differences between where we were, our
legacy solution and our new solution, and especially the types of metrics
that we're able to track right now.
And I apologize, I just probably blow this up on your screen to
see a little bit of the detail and everything, because there was so much
stuff that we really added in here.
Really three big things to note here.
One, we're using tiered hotel collectors, right?
So one set of collectors will forward those with OLTP.
OTP, sorry, over to another, the global hotel collector as well, right?
It gives us a chance to anonymize data before sending it on to
the next stage of processing.
Also lets the customer who's in control of that local one to opt in to letting
us see some, none, or all of their data.
Having that local on prem collector.
Let's the customer also manage their own telemetry, but if there's
an issue, they can change how much they're sending us, right?
And then we can, we could be opted in to help troubleshoot and then they can
dial it back when they're all clear.
And the third thing to note here is that the business metrics of job
nodes, the sizing of a job, right?
Our sort of quantum, the little red and blue dots from before, right?
How big or small it is compared to the capacity of the cluster lets
us decouple the actual cluster.
The customers running on from the service that they're buying, right?
They're buying a set of those dots.
They're not buying whether it's in cluster number one or number two.
They may be on a portion of one cluster and maybe in there with other
tenants that sort of fill it up.
Or they may spend multiple clusters because they've gotten very large.
Or they may even decide that they want to move some workload
and expand on Prem or whatever.
vice versa from on prem into the cloud to gain efficiencies.
That way the customer is paying for the operations they need, not more or less.
Okay, so can you do this in logging alone?
You can.
I think for all the questions that we brought up here, it's Locking
is certainly an easy solution, but it does have a couple drawbacks.
Ones that you're going to have to write some code, some bits around, that
OpenTelemetry really gives us for free.
it doesn't help us, for instance, when we get to implementing privacy and auditing.
The logs are they're low tech, right?
They're shared between organizations at a very granular level.
You want to be able to get rid of privacy PII in there.
You want to be able to provide traces for debugging.
You don't want all the other noise that's going on from maybe other
tenants in the infrastructure, even if they are from the same organization.
Telemetry processing can give us that, right?
And can aggregate some of the things that we would normally track as Five,
10 different API calls in a row to go.
That's one count on a business metric.
That is one operation that we would bill you for basically helps
us as a SAS provider as well.
Isolate customer data from each other, right?
So that helps as well, because we'll have a little customer ID on there.
The other thing is that remember we talked about customer buying behavior.
We don't want them to behave irrationally, right?
We don't want them to overbuy because they're afraid they're going to run out.
We don't want them to limit their usage because they're afraid they'll run
over in terms of billing or they'll have dynamic billing applied to them.
We provide customers basically a flexible mechanism where they can buy
chunks of, Quantum of operations, right?
So you can buy 500 or a thousand or in chunks of reasonably useful increments,
but then that any set of increments mirrors their actual usage, right?
So they see steps up in their licensing.
It's not a gradual curve or any sort of performance problem where they miss,
Maybe submit a giant job and then they realize, oh, that wasn't the job I wanted
and it spikes their load and we don't have a way to go, wait, let's hold off
on that one and have that discussion with the customer and go, is that really what
you wanted to do and get them into the plan that they're really looking for?
what we don't want to have is a mismatch between billing and what they're at, what
value they're actually seeing, right?
So the business metrics give us that aggregation, that abstraction.
Okay.
And then finally, of course, on the troubleshooting side, our operators
already knew with our previous implementation, having 100 customers
in the same logged cluster, it's tough to follow a single logical
operation through that, right?
Much less through one server.
But then if you think of 30 or 40 containers where you're having to
aggregate, cross them and trace between them what really happened,
and you don't want to bother the other 99 customers in that cluster.
Okay.
So some technical notes on the prototype here.
We're using simple hotel emitters and collectors, the local collectors
followed by a global collector again.
each of those is a container that can embellish those input events, the tracing
and the metrics, and create other events as part of feedback to the manager
processes, like provision and billing.
Decoupling our cluster capacity from logical customer deployment
helps solve the HPA only gets you HPA only gets you so far problem.
Horizontal pod autoscaling, right?
If a customer needs 500 scale units and we can only get up to 100 per
cluster, then our provisioner knows that he has to put another 4 out there.
There's some things we don't do yet, right?
So there's some things that will be totally transparent with moving
workloads between cloud and on prem.
That's tricky.
That's not an easy thing.
It's not something that hotel necessarily gives you out of the box.
pinning allocations from that provisioner to specific clusters.
So you need to stay in a GDPR region or have some special licensing like
that, or also putting true AI in there, or maybe machine learning to
make intelligent provisioning choices.
Those are all areas for us to grow.
Again, to summarize new versus old, this is at the business level here.
we're using tiered OpenTelemetry.
This helps us grow with our customers and very transparent.
We have the same code base for on prem as our hosted service.
And it's a multi tenant solution where the customer sees their own
dashboard and controls how much they want us as a service provider to see.
Between this and another complementary on prem solution, we can bring
business metrics together.
With the customer licensing and Kubernetes performance data to show
a more complete picture back to the customer so they can take their own
appropriate action and really self service based on what they want to use, right?
Isn't that the whole purpose of telemetry?
And finally, I'm going to say, we picked open telemetry.
There are a lot of other solutions out there.
Prometheus, Datadog, Splunk, a lot of these will enhance
logging to a great degree.
We didn't want something specifically that would add extra load onto
the services that we're doing.
It had to be something that was native in the service that it could emit
telemetry, but it wasn't an extra burden to have on there another.
Either a pull mechanism or a, an agent of some sort that's on
there that's also running, right?
And wasn't really, I would say cloud native, but really not container friendly.
So finally here, this shows some resources for the talk.
An OTEL getting started in NET, really easy.
Takes five minutes.
an OWASP logging checklist.
so if you haven't been out to the security site and seen what you should log and what
you shouldn't log, it's a good example of what the security folks might ask for.
And our repository.
For our dot net sample, the one that I wrote, and that
will be available by the month.
If you shoot me your email, I'll notify you directly.
Otherwise, just follow the space and you should see the update when it hits.
and then finally, the documentation of the Chef 360 system.
If you want to understand some background of what we don't want to talk about
jobs and runners and things like that.
Okay, and that actually should be GA within days of this
talk as our hosted service.
So we're pretty proud of that.
Finally, thank you for listening to this talk.
We wish you a very helpful and inspiring conference 42 from Brian and Chef.
Goodbye.