Conf42 Observability 2025 - Online

- premiere 5PM GMT

Faster, Cheaper, Smarter: Lessons from Building Zero-Instrumentation Observability with Fluent Bit OpenSearch & Prometheus

Video size:

Abstract

We rebuilt our platform’s observability using Fluent Bit, OpenSearch & Prometheus, improving user experience with <5s log latency and ~70% cost savings. This talk explores our migration, achieving zero-instrumentation telemetry via eBPF and lessons building a faster, cheaper OSS stack.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. Welcome to our Con 42 observability session where we will talk about how we built an observability platform using fluent bit open search and Prometheus. My name is Nian Costa, and I work as an associate technical lead at WS Soto within the Couri Observability team. And I'm Isla and I'm a senior software engineer within co observability. Before we go into the observability platform, let me first explain why we needed that. Every company nowadays is a software company. For them to serve their customers, they need some sort of software, and this all starts with the developer writing some code and committing it to a code repository. This code needs to come out as an in product to their customers. This could be something like an A PIA web application or a mobile app. There are several different things that the code must go through for that end result to happen. This could be security scans, CICD processes and so on. To do all of that, a platform is required. An internal developer platform building an internal developer platform is certainly possible for every company. However, the problem is it takes a lot of time and expertise. Expertise. We at ws, so two, have already built a platform called Corio, and we are providing it as a service. Now the important internal developer platform needs several different parts, and observability is one of them. That is because when you run some sort of a program, you need to understand how it is performing. You need to know whether there are any problems within it, whether there are any error, and so on. And to do that, you need observability. And both of us are in the co observability team and we built this observability platform within our team for corio. And this talk would be about how we created that observability platform. Let's talk a little bit about the history of this observability side of things. Initially, WS O2 was heavily invested in Asia, and our career platform we were building was heavily centered around Azure as well. And for the observability part, we also use Azure native solutions like log analytics. A DX and event up. So from this simplified architecture diagram, you can see how our initial setup looks like. So we have few services running in our nodes, and each known has Azure o Ms agent and this agent extract the logs and telemetry from the services CPN Marick and write them to Asia Organics. And for tracing and metrices, we deployed Open Telemetry Collector, which to which all the services publish data and itself published data to A DX. And we wrote a simple API, which queries both Asia and A DX for both login and metricses. And these data then published to our dashboard where we show the. Logging event metrics view. However, there were some challenges after we kept running this for some time. First one was observability doesn't come cheap. Publishing logs, metrics into a cloud platform and querying them cost a lot. So we wanted to try and reduce his cost as much as possible. Furthermore, although we started with Azure in the first place, as time went on, we got requirements from customers to support other cloud providers. Some wanted to have their co data plans that is a part of the COO platform on AWS, and then there were requirements to run it on Vulture. Then we also wanted to support on-premise deployments as well, so at the end of the day, we had to support multiple different cloud platforms. Also, we wanted to have more control over this observability stack. That is because if we have full control over it, we would be able to build more features on top of it. So these were some challenges that we wanted to address. And in order to do that, we decided to create a new platform using some open source tools. So with those challenges, we defined our next generation observability stack. With two main perspectives in mind for courier users. Our primary customers, we wanted to provide speed that means drastically reduced log latency. And we wanted to also provide smarter tools such as individual UIs, where like you can see all their services are interacting. And then another main key, the decision we made was we want this platform to be zero instrumentation. That means user doesn't have to do anything, they just deploy and we take care of the rest. And on the platform side one of the main concern we had was the cost, right? Because observability is not cheap. We were racking up a lot of infrastructure cost and we want to say, if we can reduce this and another, the problem that arose, as Nisan said, the Cloud Agnos ability. So we initially, even though we developed this system to run within Asia there are new prospect coming our way asking, can we run this on AWS, can we run this on-prem? So we wanted to provide the observability stack for those customers as well. And within eu, especially EU region. This data sovereignty is a really important topic. For example, if a cluster is running on the EU region UKO within eu, we can't export those data to our central log organic solution running in us. So we had to run two. So we wanted to make sure we can further scope these logs to the data plane itself, so we comply with this regulation. And another main thing we want to do was have control over our logging sec so we can build and ship features faster since we have full control. So we decided to create this new platform and we decided to use open search fluent bit and Prometheus as the main tools for that fluent wheat. And Prometheus was used for the log side of things. In Korea, when users come and deploy their applications, they get deployed within a Kubernetes cluster. So we have fluent bid running as a d set, collecting logs from all these containers and publishing them to open search. Open search was used as a log storage location, and then we use the open search operator to manage open search. Finally, we use logging a PA, that is our own internal a PA to do the log fetching part. When a user goes into the co console, the UI and makes a request to fetch the logs, the request comes through some ingress controllers, API gateway trace, and then it reaches the logging API would talk to open search, fetch the required logs, and send them back to the Korea console. Okay. For metrics, we adapted promises which is a stand industry standard for metrics. And to manage promises. We use the promises operator because we wanted to send updates to promises and surrounding infrastructure smoothly. And for long-term data retention and high reliability. We use tens. And if you're familiar with Protus, Protus works in a pull model rather than push, as pool targets, we had three. So the first one was Cube State Metrics which gathers the Kubernetes context and export es, for example PO labels pod status for allocations, recur and limits. And those get to promises through Cube State Metrics. And we had see advisor running on all Cubelets. Sea Advers is part of the cube blood standard cube blood distribution which can be used to scrape the CPU memory and other matrices used by the singular container. And we also used a tool called Hubble, which is from Celia, which is a like sub project from Celia, where we can extract its TTP level. Basically layer seven and layer three to four tric without any user instrumentation. For example, through Hubbell, we were able to extract all the Htt p status codes, latencies and save them within promises so we can show how the services behaving if it's a H three DB service. And and everything, all this information is stored in Prometheus. And we have your replicas of PROEs, which is backed by a Thanos Korea, which aggregate this data and serve to our metrics a p, which in turn feed the data to our dashboards. That was like the high level architecture of the logs, PLA, and metrics platform, the observability platform. Next, we will do a deep dive into each of these areas, and I will start with the log side of things. Now our journey with open search and fluent bit actually started because of a problem that we face when we were using log analytics. Now, as I mentioned earlier, when our customers come and deploy the applications, they end up as containers within our Kubernetes cluster. So the Asia agent that is using that is collecting these logs, would publish them to log analytics, and we were querying log analytics to fetch those logs. However, there were also log latency problem, the time it took for the log entry to be generated by the container, and the time that it appeared within log analytics was a little bit too much as per Azure documentation. This latency could be anywhere between 20 seconds and three minutes, and we were also experiencing a latency around that number. So this affected the developer experience in our internal developer platform because once customers deploy their applications within our observability, we, there was a place where they could look at the logs. So if a login is generated and they have to wait a minute or two, that hindered the developer experience a bit. So we wanted to try and improve this by reducing this log latency during some internal discussions that we had. We decided on a goal. We wanted to bring this latency down to five seconds or less. That is because we wanted to give a tail miner safe experience when it comes to logs. So someone comes and deploys the application and once they go into the runtime logs, they would see logs coming coming as soon as they're generated. So we wanted to provide that experience and we were looking for tools to do that. We evaluated so many tools. We evaluated some other observability, external observability platforms as well. And finally we decided to go with Fluent Bit and Open search. The initial deployment of Fluent Bit and Open Search showed a log latency of about two to three seconds, so that was very good for us. We were able to achieve our goal through that, so we decided to go ahead with that. Now this deployment was a bit different from the current deployment that we had because back then we use open search as a cache. Whenever logs are generated, we pump them into open search as well as log analytics, and we use two different APIs to fetch data and show them in the uua. We had a live logs a PA, which fetched data from open search. And a historical logs API, which fetched data from Log Analytics, open search retained data for about an hour, an hour and a half. And log analytics retained for a longer period than that. So since open search had a lower latency, we were able to show the logs quickly by fetching the that data from the live logs API at the same time for users who wanted to look at historical data. Logs that were about a day or a seven days old when they set that time range in the runtime logs wave, the platform would switch to the historical logs API and fetch them from log analytics. So this initial deployment worked well and we got a good learning experience with fluent bit and open search, and we were actually impressed by its performance and ease of use. So therefore we decided if you're going to build a new platform, we could certainly expand this that is fluent be 10 open search to store all the logs. So we created the log stack based on fluent be 10 open search and we added some other tools as well. So fluent bit was used to con collect all the container logs. As I mentioned earlier, it Tran has demon sets. It collected all the logs and published them into open search. Open search was the log storage location and all the querying and collecting happened here. When we built this updated observability platform, we deployed open search as a distributed deployment. Instead of just running it as an all-in-one deployment, we separated out the master nodes and the data nodes and used a distributed deployment. We also use the open search operator to manage it, and I will explain why in a little while. And then we use open search dashboards for our internal users. So let's say there is some sort of an incident or a bug that is reported. Then what usually happens is our SRE team needs to access the logs of those services to understand what is happening. And with Log Analytics and Azure, we had the portal to do it, the Azure portal. But with open search, we decided to go ahead with the dashboard so our SRA team could log into the open search dashboard and access these logs. Finally, we kept using the Logging API, which is our internal service, to talk to open search, fetch the logs, and return them back to the Korea console. Now, when it came to the deployment, we decided to go with the health-based approach. What we did was we based our health chart. On the upstream helm chart, and we customized it the way that we wanted. So with the open search operator base deployment, we had objects like open search index templates, open search index, state management policies, and so on. So we combined all this into a helm chart with the configurations that we wanted, and we wrote it to support different cases where we needed different customizations as well. And the reason why we went with the distributed deployment of open search was it is better suited to handle production traffic when we separate out the master nodes and the data nodes. All the cluster management activities are done by the master nodes, and then things like log ingestion, log querying, they're handled by the data nodes. So if there is a burst of logs, if there is a high amount of traffic for query. All that would go into the data nodes and nothing would happen to the master nodes. They could continue managing the cluster. So by separating this, we were able to achieve better availability and not affect different functionalities because of something like a spike in logs. Also, the reason the, one of the main reasons why we went with the operator was it made upgrades much easier. So when it comes to open search, if you want to do a version upgrade, there is a particular order where you need to upgrade the nodes. If we wanted to do that with multiple Kubernetes clusters, that would require a lot of manual intervention because the thing is with coo, we have this thing called private data planes, where we set up private corio data planes for customers, and those are dedicated for that customer itself. So when we get a lot of customers, that means we will have to manage upgrades in each of these data planes. So doing things manually was error prone, and then it would take a lot of time as well. But with the operator, it made the upgrade process much easier. We could specify the new version and the operator would handle the restart order and everything. Furthermore, when it comes to managing the cluster in day-to-day running scenarios as well, operator takes care of things like rollout, restarts, then other index management and so on. So we went with the operator and to ensure that there is backups, we implemented backups at several levels. The first layer would be backups within the open search cluster. In order to do that, we went with primary shards and replica shards in open search. So what happens is when a login trace ingested into open search, there would be one primary copy and one no more replica copies Thereafter, piece spread all these shards into different open search nodes, and then these nodes were spread across different failure domains. Something like availability zones. As an example, let's say there is an open search setup where we have one primary shard, two replica shards, and three open search data pods. What happens is one open search pod would store their primary shard and the other two pods, they would store the replica shards, and these three pods would be spread across three different availability zones. The advantage of that is. If one of the pods go down or even the PVC that is having the logs stored in the logs gets deleted. Once a new pod comes up, it would be able to replicate data from the other open search pods that happens within open search internally without any intervention. And by having failure domains like availability zones, even if an entire availability zone goes down. The observability stack could continue working. Log ingestion and query could continue with the remaining pods, so that was like the first layer of backup. Then we also enabled open search snapshots because we could use that as the last resort. Let's say we have all these spots spread across different availability zones, but then due to some disaster, we lose everything. In that case, we are able to restore logs from the snapshot and the snapshot would store logs in a different location than the open search cluster. By running this platform, we were able to achieve a lot of cost saving as well. Instead of using log analytics and in switching to fluent be 10 open search, we were able to. St. Save about $1,800 per month. The log analytics, ingestion, and storage cost. And then there was one particular customer on AWS where they were able to save about 2,500 USD per day from their AWS infrastructure course. Now, there was one incident where the workloads that were deployed by this customer started printing some error logs. And all that was ingested into AWS CloudWatch logs, and when they started querying the logs, this added up querying cost as well, and that's why their infrastructure bills skyrocketed. And by moving to a platform like Open Search, there is no extra querying cost. There is some cost for storage, but the querying cost would not be there. That's why we were able to save such a lot by using a platform like this. So there were several lessons that we learned during this time. When it comes to a platform where you have containers publishing different types of logs, it's better to store the logs in the raw format that they get generated. And if someone wants to show it in different formats, they could use different internal services to break it up into. Different formats and then log throttling is something that is useful when fluent, which starts publishing logs to open search. There could be cases where the open search cluster could be overwhelmed. To prevent something like that from happening, we could use log throttling, and we were able to do that by writing a custom lu script and adding it to fluent bit. Health monitoring is also important in a platform like this. To do health monitoring, we used an external tool to check the health of open search fluent bit, and our logging, a PA and health monitoring has to be done at different levels. You need to see whether all the pods up and running, whether the CPU usage and the memory usage of the mall, right? Whether the disc is reaching its capacity, so on. So all that needs to be monitored. And the storage medium that is used for open search plays an important role as well. There was one deployment that we did with EFS in AWS and we noticed that it didn't have the performance that we expected Thereafter, we switched to EBS and things started working properly afterwards. So the storage medium plays an important role. Also, when you have a observability stack like this, you need to look at everything as a whole when you are debugging. So there could be cases where fluent bit bots would not be able to publish logs. However, the underlying cost could be something to do with the storage medium that is used in open search. So simply by looking at the metrics of fluent bit, you can't determine whether you need to increase those metrics because of a. Usage increase or whether there's an underlying problem elsewhere. So that's why you need to look at everything as a whole. And then when it comes to fluent with configurations, there are so many configurations, and if you don't configure them properly, log ingestion could be affected. So sometime back we were using the iNOTiFY Watcher in the fluent be Tail plugin, and there was a problem where it stopped collecting logs after some time. And by switching to the file STA watcher, we were able to fix that problem. So you need to know the configurations that you're using in this stacks if you want to make it run in a smooth manner. And what on screen right now is a screenshot of the runtime logs, API. Here we are showing logs of the application. That is the workloads deployed by the customers and the gateway. And we are able to do different types of filtering on that log. They are able to query using different strings and all that is covered using the observability stat we built using fluent bit and open search. Okay, now let's switch to metrics side of our observability. So for metrics, our high level we had three high level goals for, to provide to our users. One is provide HTP metrics a clear insight into re request rates, error rates, latencies on their services. And the second is usage metrics. We wanted to show to our customers how much resources were allocated to them, how much they are CPU, and memory are used from those allocated costs usage. And another very exciting goal we had was to provide a service tag. So this is automatically de generated visualization of their services, how they interact with each other without ze doing any user intervention. A key enabler of this its TTP metrics and especially the zero instrumentation. And this beautiful diagram I will show in a later site is Hubble which is which leverage cilium and EBPF. On the diagram, on the left, you can see how a traditional service mesh, particular like a steel, something would work work through. So every request that comes to the particular application goes through a proxy within the container, no pod itself, right? So that comes to the pod and then the request get proxy to the backend. And during this proxy process, we can extract telemetric like error rates and what, and on the right you can see the EBPF powered solution cilium team has made. So in here you can see when a request comes, it goes to a, when request is directed at a backend, it first directly goes to a nvo proxy, which is residing within that node of the designation pod. And within this particular NY instance, they extract the telemetry data and then they, again with EBPF that request to the correct backend. This get rid of lot of the overhead that will be required to run a service mesh at this scale to provide zero instrumentation. And through Illinois, we are able to extract red MET resource like error rates HT P duration high HT P duration for HGTP and for other layer two three protocols, we directly use the data extracted through EBPF. And so talking about the so that's the HTB part and talking about the CPN memory. Usage and fire system and nectar usages. For that, we use a solution called Visor which is a standard component integrate to every on each nod. So this sea advisor subprocess talk to the container runtime running within the node and build up a cache of what are the containers running within that. And then it interacts with the C groups file system of that particular container. And from those C groups, they read the values for CPU, memory and other mixtures that need to be extracted. And after reading those values c advisor exposes a me matrix and point which the Prometheus instrument can script. So with this, we can, get detailed resource users for every container out of the box. So to enrich our metrics with cube context for a particularly labels and annotations we use another solution called Cube State Metrics, which establish a connection with the Cube API and provide a realtime up, up to date snapshot of the cube state. So this also exposes A-H-T-T-P metrics endpoint, which promeus can come and read. And this data get into Prometheus and we correlate with other information like so for example, from the C advisor, we can see the pod name and we correlate that with the labels extracted through dates so users can query the CT usage by a label. For each production rate production, gate metric systems, reliability and scalability, as well as long-term data retention is a key requirement for this, we use Thanos. So Thanos is able to integrate with lots of existing blog storage solutions like S3, Google Cloud Storage, and Asia Blog. And also Thanos provides a load balancing capability for Prometheus to run in a mode. If you look at this diagram, this is a high level architecture view of how Thanos run. I won't go through each and everything, but I just wanted to show the, highlight this diagram. So on the bottom of the screen, you can see the external promises server as well as this. Ano. So these two run in a single pod, and the Prometheus server talks to all the pros endpoints and extract the data and right to the, disk, right? And this particular disk this file Pros create and Cyca has a sub component called Shipper, which read these files and ship it to a blob storage. We configure. So this provide the long term retention. And for load balancing. Ano also exposes a new API called store API which is exposed through all the anos and ES endpoint. And when we are, when we deploy the Thanos query component, as you can see in the top of the diagram it also have a store, API, which connect to the store API running within the sidecar. And is able to extract data. So when we run a single query on Anos query it distributed that query for both in both of like how many instances they run and aggregate the resource and give us a single point to get the state of the cluster. And there are other things like channels, query, frontend for response caching. We are not currently using those and, but what we do use is the store gateway. So store gateway allows us to again query the data that was stored within the object storage. So this is another component and which get plugged into Tenus Query. We are the store API and when we query for data that is not available within the 10 instances running, it automatically goes to the store gateway and it reads the data off of the drop storage and give us the result. So this gives us a clean way to inference metrics and yeah, that's and all of these complexity translate to these beautiful design views, right? So in here you can see the request per minute, total success, total fail, as well as latencies and other key metric. For example you can see the memory usage. And CP usage as well. Not just the usage. We can see the limit and the request as for, and all these data are provided by Thanos and Prometheus in a near real time manner. So users can easily understand when there's a spike happening over, if there's services slowing down, what's going on. And another key feature here is when you see a spike and click on the particular ui, we call the. Logging API together the related logs. This provides a ECN scalable way for users to debug in production. Okay. So this is a diagram I was talking about the service diagram. This a dynamically generated UI where, which shows how each of your components interact with each other. And from here you can see the. You can see how the error rates latencies and latencies occurs between two components. These are like isolated component and particularly, for example, if one service is failing, we can easily use this diagram to identify the root causes of the problem. And so this is all powered to zero instrumentation metrics approach. Okay. So there were a lot of things we learned the hard way while developing this system. So one of the key issues we were facing as soon as we were started developing this system was the request count with a request rate isn't accurate. So if you send 15 requests, the diagram may show 13 points 13.8 or something. So our older system that chooses open telemetry and event based metrics were showing the exact count. We were wondering what is the reason for this? And we came across like few documentation from Prometheus, say, if you're using proms, you can only see the summarize and aggregated data. So if you want exact counts and events for, especially for, let's say for billing purposes, you can't use Metrices. So for that, you need to use logs. And once we understood that, everything clicks. And another thing we learned while developing was even told, using a operator. If there is a operator, use the operator because the operators, Kubernetes operators reduce the complexity of managing stateful sets and underlying infrastructure for that. Very easy. Otherwise, when we are doing upgrades, it's a multi-day planning step. Right now what we are doing is we just bumped the hem shine and the operator takes out of the rest. And another issue while we actually faced during our initial production deployment was slash metrics API can screw up the whole thing. The reason for this was so as I mentioned, we use huble to collect per metrics metrics data, and there was a bug in the hubs, metrics endpoint where sometimes. The port information is not cleared from the metrics. So that means like if the port get deleted, that metrics information leaves forever. This is particularly challenging when we have a cluster that's running a lot of CR jobs, which creates tons of port each minute. And and turns out this is a issue with C's port management cycle, and they had fixed it in a later version. So once we bump to that newer version, everything resolved. And another issue we ran in the early days of our development of this platform was we often see the corrupt file corruptions. So when you're querying in PROEs, if there are like single block was corrupted and we are going for a longer trench, the whole query fail saying this file is unreadable. And while again browsing the documentation, we noticed like promises explicitly mentioned not to use NFS. So we were using EFS AWS NFS solution. And since we found out we moved to blob storage block storage and all the platforms, and we didn't had any issues since then. And another issue we are currently facing is, some of rates on rate of sums. So when Proess has a lot of aggregation functions, especially rate and some, and the order of the operations we use matters significantly. So we had the recording rule that aggregate sum data and put the some in our data ER storage. And when we are taking the rate of that particular, time series, we notice some errors. So while investigating only we notice this particular issue. So let's revise the, so let's revise the cost because we did lot of things, right? So previously we spent around close to $1,200. On a DX and thousand dollars on log analytics per month, per data plane to retain this data and serve these to our customers and for ourself as well. But with our newer open source stack, we only pay for compute and storage. So since we manage everything for compute, we use around 250 usds. And for storage it's around thousand hundred sorry. A hundred usds. So this roughly translate to around 70% overall cost savings. And as well as this gives us more flexibility and more control and better performance. So that brings us to the end of our deep dive sections. And next we'll go through some summary points about what we learned when building this platform. When it comes to cloud providers things, providers like Azure, AWS, they have very good support when it comes to observability. They have powerful querying languages, easy to integrate solutions when you want to connect, get observative data from something like a Kubernetes cluster into Azure. So all this work really well, and, but the thing is, when you want to work with multiple crowd providers. You will have to go through some overhead to do that. That is because different cloud providers have different querying languages, different APIs, different deployment mechanisms. So if you want to use this cloud providers native solutions and support multiple vendors, you will have to add support for each and every cloud provider. But if you use a common platform where you use a common set of services everywhere. It would be very easy to enable observability for different cloud providers. And the other thing is when you have a stack that you have full control over, it would be very easy to add additional features. So we recently introduced an alerting feature within coo, and we built that alerting feature using the stack that we created. It is powered using Prometheus through and big and open search. And debugging also becomes easier when you have the same setup everywhere. Let's say you have different cloud providers and different APIs to connect to each of them. If there is a problem, you need to figure out where it went wrong. You need to understand about that API and the way it's deployed and so on. But if it's the same setup, debugging would be similar everywhere. So debugging becomes easier. And cost would also reduce when you use a platform like this, a platform, an open source platform that you built. So cost could also be reduced by doing something like that. And that brings us to today's session on how we built this observability platform using some popular tools. Thank you all for joining with us for this session and hope you were able to learn something from it. Thank you. Thank you all.
...

Isala Piyarisi

Observability Engineer @ WSO2

Isala Piyarisi's LinkedIn account Isala Piyarisi's twitter account

Nilushan Costa

Associate Technical Lead @ WSO2

Nilushan Costa's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)