Building lightweight observability system for Lean Teams

Video size:

Abstract

Have less resources to spend? Don’t worry. Discover innovative open-source tools and practical strategies to build lightweight telemetry pipelines to monitor and debug your distributed system.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Are you a small team with minimum budget not being able to spend on enterprise gate applications, but at the same time, you want to gain some really useful insights about your system. This talk is just for you. I know that distributed systems are very complex and tracing issues in a distributed system is very difficult, and at the same time, enterprise solutions, although very helpful, are very expensive, especially for small teams. So in this talk, my objective is going to be. How to build minimum observability stack and Kubernetes, keeping the costs low, while also gaining some useful insights about your system, and finally, avoiding some obvious pitfalls when that you can run into while designing some of these systems. So let's get started now. Before getting into the actual tools and how to set up the actual stack, I would like to discuss what I call the four pillars of observability. First is logging. Logging is the process of tracking specific events in your application. Now these help you answer questions like, what is happening in my system? Second, we have metrics. Metrics is nothing but numerical measurements for performance of your system over time. They really help you quantify the performance of your system so they can help you answering questions like, how is my system behaving over time? The third is tracing. Now tracing is the process of looking at your system, looking at the flow of request in your system. Now, if you're someone who is aiming for P 99 performance or really high performance system tracing will help you identifying the specific bottlenecks in your application. So that you can target that specific area, reduce the latency, and increase the overall performance. Now dashboard is nothing but visual representation of all the data that you're aggregating in all these pillars and showing it in one particular dashboard so that you can act on that information as quickly as possible. So they help you answering questions like, how quickly can we act upon identifying an issue? Now for all these pillars I have identified one of the most popular tools that are available, open source. Starting with logging flu, nd is one of the most popular log aggregator found in open source. In the open source community for monitoring. We have Prometheus. Prometheus is arguably one of the most used components in a Kubernetes clusters for aggregating your metrics data. Coming back, coming to tracing, we have a tool called Yeager. I could also find other tools like Open Telemetry, but Yeager is one of the most easy to set up especially for smaller teams. And finally, coming to dashboards. Grafana is arguably one of the most used open source tools for setting up dashboards for your observability stack. Now we are going to look at how to set up each and every single one of these components in your stack with the minimum amount of steps starting with logging, you can, if you have a Kubernetes cluster, you can directly run this command cube serial l apply. This will pull the Yam L file that is written by the Fluentd team itself. This will make sure that your cluster is always running the latest version of the flu ND components. Coming back to Helm. If you're someone who's using Helm inside your cluster, you can also run this Helm Command. This will pull the helm charts directly from the Helm repository that Fluent maintains. It'll install it in your cluster. Once you have the component running, you can have your applications push your logs directly to this aggregator where it exposes the Port 2, 4, 2 to four. And the best practice I would recommend is to use JSON log so that it's easier for parsing per fluent D. Next up, we have metrics. Now if you're someone who uses helm, you can run this command. This pulls the Prometheus components directly from the helm charts helm repository and installs it directly in your Kubernetes clusters. Once you have that running, you can use your applications. With, you can use the client libraries available in different programming languages like go Java, Python, c plus, and directly export your data into the metrics endpoint. Finally, you would also have to do one more step. You would have to specify the targets for your application. So in this example, I'm overriding the values dot yammer file into my helm config and specifying where exactly my applications lie application lies within the cluster. So in my example, this is in my app service, which is expo exposing a port 80 80. In your case, maybe be the DNS of your application itself coming to tracing. This is you can either install it via, directly by the Kubernetes l file that Yeager provides, or you can also use Helm as the other steps that we have seen so far. Once you do that, you will have a Yeager agent running inside your Kubernetes cluster. Once you have that, you can have, again, you can use the client libraries available in different programming languages. Export the data directly to the yeager's collector endpoint. Now, this is an example from the Python Library where I'm specifying the host name for my Yeager agent. There's nothing but the service name inside the Kubernetes cluster. This is usually the port where Yeager exposes its particular exporter endpoint, which is 6 8, 3 1. In your case, if it is different. If it is a different one, you can use that one. Now. Finally, we have the dashboard wherein we provide a visual representation for the data that we have collected so far. So you can either get your Grafana components running directly via the Kubernetes AML file or the helm as we have seen so far. Once you have that, you need to specify data sources for your Grafana plus Grafana component. Now, data sources may be the Prometheus, which is the metrics, the Yeager, which is nothing but the tracing data that you've collected, and also the logs that you're collecting from the fluid D. So in this example, I'm overriding the values dot yaml file inside the help config and specifying the data source for my Prometheus component. So this will make sure that whatever data that I'm collecting inside the Prometheus component is available in my dashboard on the Grafana component. Now, once you have this observability stack set up, there are some of the best practices and some of the lessons some of the obvious pitfalls that I would want you to avoid. First one is avoid alert fatigue. Now I've seen this with multiple teams setting up alerts for every single issues that pop up in their applications. Once you do that, you get so many alerts that soon it can, it it's no longer a signal, but it becomes noise, so you are more likely to miss some useful information. Now, this exactly happened in my previous company wherein we set up alerts for every single issues, and soon enough. The mailbox, our and mailbox of our entire team was so full that we stopped receiving alerts altogether and we were more likely to miss some useful alerts. So I would want you to be very specific on what issues that you want to track and set up alerts only for those issues and for other issues you can get away with setting up logging inside your fluency component. Now the second pitfall that I would want you to avoid is don't over sample metrics. Now, what that means is control the frequency of data that you're sending to the metrics endpoint. Now what this means is do not send data too frequent, for example, every second. Now, this also depends on the type of systems that you are supporting, but having a good balance is always better because if you're sending data too often. It soon becomes noise rather than signal, so you're more likely to miss some useful information. So it's better to send data at a frequent rate rather than at an acceptable rate, rather than very frequent. So something like one minute is a good balance. Finally, dashboard hygiene is one of the biggest mistakes that SRE engineers often make. What they think is, the more data we provide on the dashboard, the better insights that we can extract from it. That is actually far from truth, far from the truth, because the more data that you provide, the more noise you create. So more likely you are to miss some useful information. So what I would recommend is based on the individual that you're creating the dashboard for. Keep the data that you're providing, very focused for that individual. So for example, if you're providing the dashboard for a manager, have data that provides a 10,000 feet overview of the system rather than providing very specific data inside the application. So you keep the noise outside and you provide more signals for that particular individual. Similarly if you're creating a dashboard for. A developer, I would ask, I would want you to provide data that's giving a more deeper look into the application itself. Like something like the logging aggregator would be a good use for developers. So based on the individual that you're creating the dashboard for, keep the data very focused for that particular individual so that you don't create more noise and you provide some useful insights Now. Quickly giving a summary of what we have discussed so far. We discussed the four pillars of observability, starting with logging. This helps you answer questions like, what is happening in my system so that I can identify any issues that are going on. Looking at the logs itself, metrics helps you answer questions like, how can I quantify the performance of my system? Coming to tracing helps you answer questions like, what, where is the bottleneck in my system so that I can target a specific area and increase the overall performance of my system? And dashboards are nothing. But providing a visual representation of the data that you have created so that you can identify issues in a particular area and you can act on that quickly as soon as possible. I hope that this talk was helpful, that it provided you with the minimum amount of steps for getting an observability stack running in your system. And if you want to discuss more on this topic, and this is my LinkedIn this is my GitHub and this is my blog where I post similar software engineering content content that I learn on a day by day basis. Thank you.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Building lightweight observability system for Lean Teams

Video size:

Abstract

Summary

Transcript

Slides

Vignesh Iyer

Software Engineer @ Nvidia

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Building lightweight observability system for Lean Teams

Video size:

Abstract

Summary

Transcript

Slides

Vignesh Iyer

Software Engineer @ Nvidia

Join the community!