Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you a small team with minimum budget not being able to spend on enterprise
gate applications, but at the same time, you want to gain some really
useful insights about your system.
This talk is just for you.
I know that distributed systems are very complex and tracing issues in a
distributed system is very difficult, and at the same time, enterprise
solutions, although very helpful, are very expensive, especially for small teams.
So in this talk, my objective is going to be.
How to build minimum observability stack and Kubernetes, keeping the
costs low, while also gaining some useful insights about your system,
and finally, avoiding some obvious pitfalls when that you can run into
while designing some of these systems.
So let's get started now.
Before getting into the actual tools and how to set up the actual stack,
I would like to discuss what I call the four pillars of observability.
First is logging.
Logging is the process of tracking specific events in your application.
Now these help you answer questions like, what is happening in my system?
Second, we have metrics.
Metrics is nothing but numerical measurements for performance
of your system over time.
They really help you quantify the performance of your system so they
can help you answering questions like, how is my system behaving over time?
The third is tracing.
Now tracing is the process of looking at your system, looking at
the flow of request in your system.
Now, if you're someone who is aiming for P 99 performance or really high
performance system tracing will help you identifying the specific
bottlenecks in your application.
So that you can target that specific area, reduce the latency, and
increase the overall performance.
Now dashboard is nothing but visual representation of all the data that
you're aggregating in all these pillars and showing it in one particular
dashboard so that you can act on that information as quickly as possible.
So they help you answering questions like, how quickly can
we act upon identifying an issue?
Now for all these pillars I have identified one of the most popular
tools that are available, open source.
Starting with logging flu, nd is one of the most popular log
aggregator found in open source.
In the open source community for monitoring.
We have Prometheus.
Prometheus is arguably one of the most used components in a Kubernetes clusters
for aggregating your metrics data.
Coming back, coming to tracing, we have a tool called Yeager.
I could also find other tools like Open Telemetry, but Yeager
is one of the most easy to set up especially for smaller teams.
And finally, coming to dashboards.
Grafana is arguably one of the most used open source tools for setting up
dashboards for your observability stack.
Now we are going to look at how to set up each and every single one of
these components in your stack with the minimum amount of steps starting
with logging, you can, if you have a Kubernetes cluster, you can directly
run this command cube serial l apply.
This will pull the Yam L file that is written by the Fluentd team itself.
This will make sure that your cluster is always running the latest
version of the flu ND components.
Coming back to Helm.
If you're someone who's using Helm inside your cluster, you
can also run this Helm Command.
This will pull the helm charts directly from the Helm
repository that Fluent maintains.
It'll install it in your cluster.
Once you have the component running, you can have your applications push your
logs directly to this aggregator where it exposes the Port 2, 4, 2 to four.
And the best practice I would recommend is to use JSON log so
that it's easier for parsing per fluent D. Next up, we have metrics.
Now if you're someone who uses helm, you can run this command.
This pulls the Prometheus components directly from the helm charts helm
repository and installs it directly in your Kubernetes clusters.
Once you have that running, you can use your applications.
With, you can use the client libraries available in different programming
languages like go Java, Python, c plus, and directly export your
data into the metrics endpoint.
Finally, you would also have to do one more step.
You would have to specify the targets for your application.
So in this example, I'm overriding the values dot yammer file into
my helm config and specifying where exactly my applications lie
application lies within the cluster.
So in my example, this is in my app service, which is
expo exposing a port 80 80.
In your case, maybe be the DNS of your application itself coming to tracing.
This is you can either install it via, directly by the Kubernetes
l file that Yeager provides, or you can also use Helm as the other
steps that we have seen so far.
Once you do that, you will have a Yeager agent running
inside your Kubernetes cluster.
Once you have that, you can have, again, you can use the client libraries available
in different programming languages.
Export the data directly to the yeager's collector endpoint.
Now, this is an example from the Python Library where I'm specifying
the host name for my Yeager agent.
There's nothing but the service name inside the Kubernetes cluster.
This is usually the port where Yeager exposes its particular
exporter endpoint, which is 6 8, 3 1.
In your case, if it is different.
If it is a different one, you can use that one.
Now.
Finally, we have the dashboard wherein we provide a visual representation for
the data that we have collected so far.
So you can either get your Grafana components running directly via
the Kubernetes AML file or the helm as we have seen so far.
Once you have that, you need to specify data sources for your
Grafana plus Grafana component.
Now, data sources may be the Prometheus, which is the metrics, the Yeager, which
is nothing but the tracing data that you've collected, and also the logs
that you're collecting from the fluid D. So in this example, I'm overriding
the values dot yaml file inside the help config and specifying the data
source for my Prometheus component.
So this will make sure that whatever data that I'm collecting inside the
Prometheus component is available in my dashboard on the Grafana component.
Now, once you have this observability stack set up, there are some of
the best practices and some of the lessons some of the obvious pitfalls
that I would want you to avoid.
First one is avoid alert fatigue.
Now I've seen this with multiple teams setting up alerts for every single
issues that pop up in their applications.
Once you do that, you get so many alerts that soon it can, it it's
no longer a signal, but it becomes noise, so you are more likely
to miss some useful information.
Now, this exactly happened in my previous company wherein we set up alerts for
every single issues, and soon enough.
The mailbox, our and mailbox of our entire team was so full that we stopped
receiving alerts altogether and we were more likely to miss some useful alerts.
So I would want you to be very specific on what issues that you want to track and set
up alerts only for those issues and for other issues you can get away with setting
up logging inside your fluency component.
Now the second pitfall that I would want you to avoid is don't over sample metrics.
Now, what that means is control the frequency of data that you're
sending to the metrics endpoint.
Now what this means is do not send data too frequent, for example, every second.
Now, this also depends on the type of systems that you are supporting, but
having a good balance is always better because if you're sending data too often.
It soon becomes noise rather than signal, so you're more likely
to miss some useful information.
So it's better to send data at a frequent rate rather than at an acceptable
rate, rather than very frequent.
So something like one minute is a good balance.
Finally, dashboard hygiene is one of the biggest mistakes
that SRE engineers often make.
What they think is, the more data we provide on the dashboard, the better
insights that we can extract from it.
That is actually far from truth, far from the truth, because the more data that
you provide, the more noise you create.
So more likely you are to miss some useful information.
So what I would recommend is based on the individual that
you're creating the dashboard for.
Keep the data that you're providing, very focused for that individual.
So for example, if you're providing the dashboard for a manager, have data
that provides a 10,000 feet overview of the system rather than providing very
specific data inside the application.
So you keep the noise outside and you provide more signals
for that particular individual.
Similarly if you're creating a dashboard for.
A developer, I would ask, I would want you to provide data that's giving a more
deeper look into the application itself.
Like something like the logging aggregator would be a good use for developers.
So based on the individual that you're creating the dashboard for,
keep the data very focused for that particular individual so that you
don't create more noise and you provide some useful insights Now.
Quickly giving a summary of what we have discussed so far.
We discussed the four pillars of observability, starting with logging.
This helps you answer questions like, what is happening in my system so that I can
identify any issues that are going on.
Looking at the logs itself, metrics helps you answer questions like, how can I
quantify the performance of my system?
Coming to tracing helps you answer questions like, what, where is the
bottleneck in my system so that I can target a specific area and increase
the overall performance of my system?
And dashboards are nothing.
But providing a visual representation of the data that you have created
so that you can identify issues in a particular area and you can act
on that quickly as soon as possible.
I hope that this talk was helpful, that it provided you with the minimum amount
of steps for getting an observability stack running in your system.
And if you want to discuss more on this topic, and this is my LinkedIn
this is my GitHub and this is my blog where I post similar software
engineering content content that I learn on a day by day basis.
Thank you.