Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
Welcome to our Con 42 observability session where we will talk about how we
built an observability platform using fluent bit open search and Prometheus.
My name is Nian Costa, and I work as an associate technical lead at WS Soto
within the Couri Observability team.
And I'm Isla and I'm a senior software engineer within co observability.
Before we go into the observability platform, let me
first explain why we needed that.
Every company nowadays is a software company.
For them to serve their customers, they need some sort of software,
and this all starts with the developer writing some code and
committing it to a code repository.
This code needs to come out as an in product to their customers.
This could be something like an A PIA web application or a mobile app.
There are several different things that the code must go through
for that end result to happen.
This could be security scans, CICD processes and so on.
To do all of that, a platform is required.
An internal developer platform building an internal developer platform is
certainly possible for every company.
However, the problem is it takes a lot of time and expertise.
Expertise.
We at ws, so two, have already built a platform called Corio, and
we are providing it as a service.
Now the important internal developer platform needs several different parts,
and observability is one of them.
That is because when you run some sort of a program, you need to
understand how it is performing.
You need to know whether there are any problems within it, whether
there are any error, and so on.
And to do that, you need observability.
And both of us are in the co observability team and we built this observability
platform within our team for corio.
And this talk would be about how we created that observability platform.
Let's talk a little bit about the history of this observability side of things.
Initially, WS O2 was heavily invested in Asia, and our career
platform we were building was heavily centered around Azure as well.
And for the observability part, we also use Azure native
solutions like log analytics.
A DX and event up.
So from this simplified architecture diagram, you can see how our
initial setup looks like.
So we have few services running in our nodes, and each known has Azure o Ms
agent and this agent extract the logs and telemetry from the services CPN
Marick and write them to Asia Organics.
And for tracing and metrices, we deployed Open Telemetry Collector, which to
which all the services publish data and itself published data to A DX.
And we wrote a simple API, which queries both Asia and A
DX for both login and metricses.
And these data then published to our dashboard where we show the.
Logging event metrics view.
However, there were some challenges after we kept running this for some time.
First one was observability doesn't come cheap.
Publishing logs, metrics into a cloud platform and querying them cost a lot.
So we wanted to try and reduce his cost as much as possible.
Furthermore, although we started with Azure in the first place, as time went
on, we got requirements from customers to support other cloud providers.
Some wanted to have their co data plans that is a part of the COO
platform on AWS, and then there were requirements to run it on Vulture.
Then we also wanted to support on-premise deployments as well, so at
the end of the day, we had to support multiple different cloud platforms.
Also, we wanted to have more control over this observability stack.
That is because if we have full control over it, we would be able
to build more features on top of it.
So these were some challenges that we wanted to address.
And in order to do that, we decided to create a new platform
using some open source tools.
So with those challenges, we defined our next generation observability stack.
With two main perspectives in mind for courier users.
Our primary customers, we wanted to provide speed that means
drastically reduced log latency.
And we wanted to also provide smarter tools such as individual
UIs, where like you can see all their services are interacting.
And then another main key, the decision we made was we want this
platform to be zero instrumentation.
That means user doesn't have to do anything, they just deploy
and we take care of the rest.
And on the platform side one of the main concern we had was the cost, right?
Because observability is not cheap.
We were racking up a lot of infrastructure cost and we want to say, if we can reduce
this and another, the problem that arose, as Nisan said, the Cloud Agnos ability.
So we initially, even though we developed this system to run within Asia there are
new prospect coming our way asking, can we run this on AWS, can we run this on-prem?
So we wanted to provide the observability stack for those customers as well.
And within eu, especially EU region.
This data sovereignty is a really important topic.
For example, if a cluster is running on the EU region UKO within eu, we
can't export those data to our central log organic solution running in us.
So we had to run two.
So we wanted to make sure we can further scope these logs to the data plane
itself, so we comply with this regulation.
And another main thing we want to do was have control over our logging
sec so we can build and ship features faster since we have full control.
So we decided to create this new platform and we decided to use open
search fluent bit and Prometheus as the main tools for that fluent wheat.
And Prometheus was used for the log side of things.
In Korea, when users come and deploy their applications, they get
deployed within a Kubernetes cluster.
So we have fluent bid running as a d set, collecting logs from all these containers
and publishing them to open search.
Open search was used as a log storage location, and then we use the open
search operator to manage open search.
Finally, we use logging a PA, that is our own internal a PA
to do the log fetching part.
When a user goes into the co console, the UI and makes a request to fetch the logs,
the request comes through some ingress controllers, API gateway trace, and then
it reaches the logging API would talk to open search, fetch the required logs,
and send them back to the Korea console.
Okay.
For metrics, we adapted promises which is a stand industry standard for metrics.
And to manage promises.
We use the promises operator because we wanted to send updates to promises and
surrounding infrastructure smoothly.
And for long-term data retention and high reliability.
We use tens.
And if you're familiar with Protus, Protus works in a pull model rather than
push, as pool targets, we had three.
So the first one was Cube State Metrics which gathers the Kubernetes context
and export es, for example PO labels pod status for allocations, recur and limits.
And those get to promises through Cube State Metrics.
And we had see advisor running on all Cubelets.
Sea Advers is part of the cube blood standard cube blood distribution which can
be used to scrape the CPU memory and other matrices used by the singular container.
And we also used a tool called Hubble, which is from Celia, which
is a like sub project from Celia, where we can extract its TTP level.
Basically layer seven and layer three to four tric without
any user instrumentation.
For example, through Hubbell, we were able to extract all the Htt p status
codes, latencies and save them within promises so we can show how the services
behaving if it's a H three DB service.
And and everything, all this information is stored in Prometheus.
And we have your replicas of PROEs, which is backed by a Thanos Korea,
which aggregate this data and serve to our metrics a p, which in turn
feed the data to our dashboards.
That was like the high level architecture of the logs, PLA, and metrics
platform, the observability platform.
Next, we will do a deep dive into each of these areas, and I will
start with the log side of things.
Now our journey with open search and fluent bit actually started
because of a problem that we face when we were using log analytics.
Now, as I mentioned earlier, when our customers come and deploy the
applications, they end up as containers within our Kubernetes cluster.
So the Asia agent that is using that is collecting these logs, would publish them
to log analytics, and we were querying log analytics to fetch those logs.
However, there were also log latency problem, the time it took for the log
entry to be generated by the container, and the time that it appeared within
log analytics was a little bit too much as per Azure documentation.
This latency could be anywhere between 20 seconds and three minutes,
and we were also experiencing a latency around that number.
So this affected the developer experience in our internal developer platform because
once customers deploy their applications within our observability, we, there was a
place where they could look at the logs.
So if a login is generated and they have to wait a minute or two, that
hindered the developer experience a bit.
So we wanted to try and improve this by reducing this log latency during
some internal discussions that we had.
We decided on a goal.
We wanted to bring this latency down to five seconds or less.
That is because we wanted to give a tail miner safe
experience when it comes to logs.
So someone comes and deploys the application and once they go into the
runtime logs, they would see logs coming coming as soon as they're generated.
So we wanted to provide that experience and we were looking for tools to do that.
We evaluated so many tools.
We evaluated some other observability, external observability platforms as well.
And finally we decided to go with Fluent Bit and Open search.
The initial deployment of Fluent Bit and Open Search showed a log
latency of about two to three seconds, so that was very good for us.
We were able to achieve our goal through that, so we decided to go ahead with that.
Now this deployment was a bit different from the current deployment
that we had because back then we use open search as a cache.
Whenever logs are generated, we pump them into open search as well as log
analytics, and we use two different APIs to fetch data and show them in the uua.
We had a live logs a PA, which fetched data from open search.
And a historical logs API, which fetched data from Log Analytics,
open search retained data for about an hour, an hour and a half.
And log analytics retained for a longer period than that.
So since open search had a lower latency, we were able to show the logs quickly
by fetching the that data from the live logs API at the same time for users
who wanted to look at historical data.
Logs that were about a day or a seven days old when they set that time range
in the runtime logs wave, the platform would switch to the historical logs
API and fetch them from log analytics.
So this initial deployment worked well and we got a good learning
experience with fluent bit and open search, and we were actually impressed
by its performance and ease of use.
So therefore we decided if you're going to build a new platform, we could
certainly expand this that is fluent be 10 open search to store all the logs.
So we created the log stack based on fluent be 10 open search and
we added some other tools as well.
So fluent bit was used to con collect all the container logs.
As I mentioned earlier, it Tran has demon sets.
It collected all the logs and published them into open search.
Open search was the log storage location and all the querying
and collecting happened here.
When we built this updated observability platform, we deployed open search
as a distributed deployment.
Instead of just running it as an all-in-one deployment, we separated
out the master nodes and the data nodes and used a distributed deployment.
We also use the open search operator to manage it, and I will
explain why in a little while.
And then we use open search dashboards for our internal users.
So let's say there is some sort of an incident or a bug that is reported.
Then what usually happens is our SRE team needs to access the logs of those
services to understand what is happening.
And with Log Analytics and Azure, we had the portal to do it, the Azure portal.
But with open search, we decided to go ahead with the dashboard so our SRA
team could log into the open search dashboard and access these logs.
Finally, we kept using the Logging API, which is our internal service, to talk
to open search, fetch the logs, and return them back to the Korea console.
Now, when it came to the deployment, we decided to go
with the health-based approach.
What we did was we based our health chart.
On the upstream helm chart, and we customized it the way that we wanted.
So with the open search operator base deployment, we had objects like open
search index templates, open search index, state management policies, and so on.
So we combined all this into a helm chart with the configurations that
we wanted, and we wrote it to support different cases where we needed
different customizations as well.
And the reason why we went with the distributed deployment of open search
was it is better suited to handle production traffic when we separate out
the master nodes and the data nodes.
All the cluster management activities are done by the master nodes, and then
things like log ingestion, log querying, they're handled by the data nodes.
So if there is a burst of logs, if there is a high amount of traffic for query.
All that would go into the data nodes and nothing would happen to the master nodes.
They could continue managing the cluster.
So by separating this, we were able to achieve better availability and not
affect different functionalities because of something like a spike in logs.
Also, the reason the, one of the main reasons why we went with the operator
was it made upgrades much easier.
So when it comes to open search, if you want to do a version upgrade,
there is a particular order where you need to upgrade the nodes.
If we wanted to do that with multiple Kubernetes clusters, that would require
a lot of manual intervention because the thing is with coo, we have this
thing called private data planes, where we set up private corio data
planes for customers, and those are dedicated for that customer itself.
So when we get a lot of customers, that means we will have to manage
upgrades in each of these data planes.
So doing things manually was error prone, and then it would
take a lot of time as well.
But with the operator, it made the upgrade process much easier.
We could specify the new version and the operator would handle
the restart order and everything.
Furthermore, when it comes to managing the cluster in day-to-day running
scenarios as well, operator takes care of things like rollout, restarts,
then other index management and so on.
So we went with the operator and to ensure that there is backups, we
implemented backups at several levels.
The first layer would be backups within the open search cluster.
In order to do that, we went with primary shards and replica shards in open search.
So what happens is when a login trace ingested into open search, there would
be one primary copy and one no more replica copies Thereafter, piece spread
all these shards into different open search nodes, and then these nodes were
spread across different failure domains.
Something like availability zones.
As an example, let's say there is an open search setup where we have one
primary shard, two replica shards, and three open search data pods.
What happens is one open search pod would store their primary shard
and the other two pods, they would store the replica shards, and these
three pods would be spread across three different availability zones.
The advantage of that is.
If one of the pods go down or even the PVC that is having the logs
stored in the logs gets deleted.
Once a new pod comes up, it would be able to replicate data
from the other open search pods that happens within open search
internally without any intervention.
And by having failure domains like availability zones, even if an
entire availability zone goes down.
The observability stack could continue working.
Log ingestion and query could continue with the remaining pods, so that
was like the first layer of backup.
Then we also enabled open search snapshots because we could
use that as the last resort.
Let's say we have all these spots spread across different
availability zones, but then due to some disaster, we lose everything.
In that case, we are able to restore logs from the snapshot and the snapshot
would store logs in a different location than the open search cluster.
By running this platform, we were able to achieve a lot of cost saving as well.
Instead of using log analytics and in switching to fluent be
10 open search, we were able to.
St. Save about $1,800 per month.
The log analytics, ingestion, and storage cost.
And then there was one particular customer on AWS where they were able
to save about 2,500 USD per day from their AWS infrastructure course.
Now, there was one incident where the workloads that were deployed by this
customer started printing some error logs.
And all that was ingested into AWS CloudWatch logs, and when they started
querying the logs, this added up querying cost as well, and that's why
their infrastructure bills skyrocketed.
And by moving to a platform like Open Search, there is no extra querying cost.
There is some cost for storage, but the querying cost would not be there.
That's why we were able to save such a lot by using a platform like this.
So there were several lessons that we learned during this time.
When it comes to a platform where you have containers publishing different types of
logs, it's better to store the logs in the raw format that they get generated.
And if someone wants to show it in different formats, they
could use different internal services to break it up into.
Different formats and then log throttling is something that is
useful when fluent, which starts publishing logs to open search.
There could be cases where the open search cluster could be overwhelmed.
To prevent something like that from happening, we could use log
throttling, and we were able to do that by writing a custom lu
script and adding it to fluent bit.
Health monitoring is also important in a platform like this.
To do health monitoring, we used an external tool to check the health
of open search fluent bit, and our logging, a PA and health monitoring
has to be done at different levels.
You need to see whether all the pods up and running, whether the CPU usage and
the memory usage of the mall, right?
Whether the disc is reaching its capacity, so on.
So all that needs to be monitored.
And the storage medium that is used for open search plays
an important role as well.
There was one deployment that we did with EFS in AWS and we noticed that it didn't
have the performance that we expected Thereafter, we switched to EBS and things
started working properly afterwards.
So the storage medium plays an important role.
Also, when you have a observability stack like this, you need to look at everything
as a whole when you are debugging.
So there could be cases where fluent bit bots would not be able to publish logs.
However, the underlying cost could be something to do with the storage
medium that is used in open search.
So simply by looking at the metrics of fluent bit, you can't determine whether
you need to increase those metrics because of a. Usage increase or whether
there's an underlying problem elsewhere.
So that's why you need to look at everything as a whole.
And then when it comes to fluent with configurations, there are so
many configurations, and if you don't configure them properly,
log ingestion could be affected.
So sometime back we were using the iNOTiFY Watcher in the fluent be Tail
plugin, and there was a problem where it stopped collecting logs after some time.
And by switching to the file STA watcher, we were able to fix that problem.
So you need to know the configurations that you're using in this stacks if you
want to make it run in a smooth manner.
And what on screen right now is a screenshot of the runtime logs, API.
Here we are showing logs of the application.
That is the workloads deployed by the customers and the gateway.
And we are able to do different types of filtering on that log.
They are able to query using different strings and all that is covered
using the observability stat we built using fluent bit and open search.
Okay, now let's switch to metrics side of our observability.
So for metrics, our high level we had three high level goals
for, to provide to our users.
One is provide HTP metrics a clear insight into re request rates, error
rates, latencies on their services.
And the second is usage metrics.
We wanted to show to our customers how much resources were allocated to them,
how much they are CPU, and memory are used from those allocated costs usage.
And another very exciting goal we had was to provide a service tag.
So this is automatically de generated visualization of their services, how
they interact with each other without ze doing any user intervention.
A key enabler of this its TTP metrics and especially the zero instrumentation.
And this beautiful diagram I will show in a later site is Hubble which
is which leverage cilium and EBPF.
On the diagram, on the left, you can see how a traditional service
mesh, particular like a steel, something would work work through.
So every request that comes to the particular application
goes through a proxy within the container, no pod itself, right?
So that comes to the pod and then the request get proxy to the backend.
And during this proxy process, we can extract telemetric like
error rates and what, and on the right you can see the EBPF powered
solution cilium team has made.
So in here you can see when a request comes, it goes to a, when request is
directed at a backend, it first directly goes to a nvo proxy, which is residing
within that node of the designation pod.
And within this particular NY instance, they extract the telemetry
data and then they, again with EBPF that request to the correct backend.
This get rid of lot of the overhead that will be required to run a
service mesh at this scale to provide zero instrumentation.
And through Illinois, we are able to extract red MET resource like
error rates HT P duration high HT P duration for HGTP and for other layer
two three protocols, we directly use the data extracted through EBPF.
And so talking about the so that's the HTB part and talking about the CPN memory.
Usage and fire system and nectar usages.
For that, we use a solution called Visor which is a standard component
integrate to every on each nod.
So this sea advisor subprocess talk to the container runtime running within
the node and build up a cache of what are the containers running within that.
And then it interacts with the C groups file system of that particular container.
And from those C groups, they read the values for CPU, memory and other
mixtures that need to be extracted.
And after reading those values c advisor exposes a me matrix and point which
the Prometheus instrument can script.
So with this, we can, get detailed resource users for
every container out of the box.
So to enrich our metrics with cube context for a particularly labels and annotations
we use another solution called Cube State Metrics, which establish a connection with
the Cube API and provide a realtime up, up to date snapshot of the cube state.
So this also exposes A-H-T-T-P metrics endpoint, which promeus can come and read.
And this data get into Prometheus and we correlate with other information like so
for example, from the C advisor, we can see the pod name and we correlate that
with the labels extracted through dates so users can query the CT usage by a label.
For each production rate production, gate metric systems, reliability
and scalability, as well as long-term data retention is a key
requirement for this, we use Thanos.
So Thanos is able to integrate with lots of existing blog storage solutions like
S3, Google Cloud Storage, and Asia Blog.
And also Thanos provides a load balancing capability for
Prometheus to run in a mode.
If you look at this diagram, this is a high level architecture
view of how Thanos run.
I won't go through each and everything, but I just wanted to
show the, highlight this diagram.
So on the bottom of the screen, you can see the external
promises server as well as this.
Ano.
So these two run in a single pod, and the Prometheus server talks to all
the pros endpoints and extract the data and right to the, disk, right?
And this particular disk this file Pros create and Cyca has a sub component
called Shipper, which read these files and ship it to a blob storage.
We configure.
So this provide the long term retention.
And for load balancing.
Ano also exposes a new API called store API which is exposed through
all the anos and ES endpoint.
And when we are, when we deploy the Thanos query component, as you can see
in the top of the diagram it also have a store, API, which connect to the
store API running within the sidecar.
And is able to extract data.
So when we run a single query on Anos query it distributed that query
for both in both of like how many instances they run and aggregate the
resource and give us a single point to get the state of the cluster.
And there are other things like channels, query, frontend for response caching.
We are not currently using those and, but what we do use is the store gateway.
So store gateway allows us to again query the data that was
stored within the object storage.
So this is another component and which get plugged into Tenus Query.
We are the store API and when we query for data that is not available
within the 10 instances running, it automatically goes to the store gateway
and it reads the data off of the drop storage and give us the result.
So this gives us a clean way to inference metrics and yeah, that's and
all of these complexity translate to these beautiful design views, right?
So in here you can see the request per minute, total success, total fail, as
well as latencies and other key metric.
For example you can see the memory usage.
And CP usage as well.
Not just the usage.
We can see the limit and the request as for, and all these data are
provided by Thanos and Prometheus in a near real time manner.
So users can easily understand when there's a spike happening over, if there's
services slowing down, what's going on.
And another key feature here is when you see a spike and click on
the particular ui, we call the.
Logging API together the related logs.
This provides a ECN scalable way for users to debug in production.
Okay.
So this is a diagram I was talking about the service diagram.
This a dynamically generated UI where, which shows how each of your
components interact with each other.
And from here you can see the.
You can see how the error rates latencies and latencies
occurs between two components.
These are like isolated component and particularly, for example,
if one service is failing, we can easily use this diagram to identify
the root causes of the problem.
And so this is all powered to zero instrumentation metrics approach.
Okay.
So there were a lot of things we learned the hard way while developing this system.
So one of the key issues we were facing as soon as we were started developing
this system was the request count with a request rate isn't accurate.
So if you send 15 requests, the diagram may show 13 points 13.8 or something.
So our older system that chooses open telemetry and event based
metrics were showing the exact count.
We were wondering what is the reason for this?
And we came across like few documentation from Prometheus, say,
if you're using proms, you can only see the summarize and aggregated data.
So if you want exact counts and events for, especially for, let's say for
billing purposes, you can't use Metrices.
So for that, you need to use logs.
And once we understood that, everything clicks.
And another thing we learned while developing was even
told, using a operator.
If there is a operator, use the operator because the operators,
Kubernetes operators reduce the complexity of managing stateful sets
and underlying infrastructure for that.
Very easy.
Otherwise, when we are doing upgrades, it's a multi-day planning step.
Right now what we are doing is we just bumped the hem shine and the
operator takes out of the rest.
And another issue while we actually faced during our initial production
deployment was slash metrics API can screw up the whole thing.
The reason for this was so as I mentioned, we use huble to collect per metrics
metrics data, and there was a bug in the hubs, metrics endpoint where sometimes.
The port information is not cleared from the metrics.
So that means like if the port get deleted, that metrics
information leaves forever.
This is particularly challenging when we have a cluster that's
running a lot of CR jobs, which creates tons of port each minute.
And and turns out this is a issue with C's port management cycle, and
they had fixed it in a later version.
So once we bump to that newer version, everything resolved.
And another issue we ran in the early days of our development of this platform was
we often see the corrupt file corruptions.
So when you're querying in PROEs, if there are like single block was corrupted and we
are going for a longer trench, the whole query fail saying this file is unreadable.
And while again browsing the documentation, we noticed like promises
explicitly mentioned not to use NFS.
So we were using EFS AWS NFS solution.
And since we found out we moved to blob storage block storage
and all the platforms, and we didn't had any issues since then.
And another issue we are currently facing is, some of rates on rate of sums.
So when Proess has a lot of aggregation functions, especially rate and some,
and the order of the operations we use matters significantly.
So we had the recording rule that aggregate sum data and put
the some in our data ER storage.
And when we are taking the rate of that particular, time
series, we notice some errors.
So while investigating only we notice this particular issue.
So let's revise the, so let's revise the cost because we did lot of things, right?
So previously we spent around close to $1,200.
On a DX and thousand dollars on log analytics per month, per data plane
to retain this data and serve these to our customers and for ourself as well.
But with our newer open source stack, we only pay for compute and storage.
So since we manage everything for compute, we use around 250 usds.
And for storage it's around thousand hundred sorry.
A hundred usds.
So this roughly translate to around 70% overall cost savings.
And as well as this gives us more flexibility and more
control and better performance.
So that brings us to the end of our deep dive sections.
And next we'll go through some summary points about what we
learned when building this platform.
When it comes to cloud providers things, providers like Azure,
AWS, they have very good support when it comes to observability.
They have powerful querying languages, easy to integrate
solutions when you want to connect, get observative data from something
like a Kubernetes cluster into Azure.
So all this work really well, and, but the thing is, when you want to
work with multiple crowd providers.
You will have to go through some overhead to do that.
That is because different cloud providers have different querying
languages, different APIs, different deployment mechanisms.
So if you want to use this cloud providers native solutions and support multiple
vendors, you will have to add support for each and every cloud provider.
But if you use a common platform where you use a common set of services everywhere.
It would be very easy to enable observability for
different cloud providers.
And the other thing is when you have a stack that you have full
control over, it would be very easy to add additional features.
So we recently introduced an alerting feature within coo, and
we built that alerting feature using the stack that we created.
It is powered using Prometheus through and big and open search.
And debugging also becomes easier when you have the same setup everywhere.
Let's say you have different cloud providers and different
APIs to connect to each of them.
If there is a problem, you need to figure out where it went wrong.
You need to understand about that API and the way it's deployed and so on.
But if it's the same setup, debugging would be similar everywhere.
So debugging becomes easier.
And cost would also reduce when you use a platform like this, a platform, an
open source platform that you built.
So cost could also be reduced by doing something like that.
And that brings us to today's session on how we built this observability
platform using some popular tools.
Thank you all for joining with us for this session and hope you were
able to learn something from it.
Thank you.
Thank you all.