Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
My name is Mohammad Ahmad Saeed and today I'm going to talk
about building observability into cloud native applications.
Let's first look at what is observability.
So observability is the ability to understand the internal state of a
system based on its external outputs.
In simpler terms, observability is about answering the question, what's happening
inside my application or system and why?
Observability is often confused with monitoring, but they are different.
Observability, it goes beyond traditional monitoring and monitoring simply
tells us if something is wrong or not.
For instance, monitoring will tell us if my CPU is high or not.
If my response time is slow or not, but observability helps
us understand why it's wrong.
it empowers us to arbitrary questions about our systems and get answers
without needing to predict every possible failure scenario beforehand.
Let's look at what is the pillar of, what are the pillar of observability.
So observability has three pillars.
The first is the logs.
So logs are the timestamp records of events that occurred in the system.
For instance, the logs, whenever there is a booking made, there
is a, the login happened.
the payment is made.
So they are the structured unstructured, records of events that happened in the
system and they provide us the details, context, the detail context about like
specific operations and they can be extremely valuable for debuggings, for
instance, like If there is any exception happened or if there is some failure
happened then the log will tell me the exact message of like why that happened
and what exactly was the problem.
The next pillar is the matrix.
matrix It's metrics are the quantitative data about system performance.
for instance, the CPU usage, the request latency, the memory usage,
the error rates, the throughputs, there can be a lot of metrics, but.
yeah, it's more about the quantitative data rather than
the data about individual events.
And traces are, the end to end, tracking of the requests as they propagate
through the distributed system.
So these traces help pinpoint performance bottleneck and understand dependencies.
So using traces, we can Exactly identify how our request went from
one service into the other service.
And then, yeah, we can track all the routes that it took,
for that's for a given request.
Now, let's see why is observability crucial for cloud native.
So cloud native applications are inherently complex, they are
distributed like microservices, they have a lot of microservices and those
microservices communicate over network, which actually introduce latency
and then there are a lot of failure points because of, of using network.
for communication.
The next is the dynamic.
So all the containers and orchestration platforms like Kubernetes, they constantly
spin up and tear down resources.
Like for instance, they can, whenever the demand for the demand grows, they can
create new database, but also when the demand lowers, they can they can kill that
the specific instances that were created.
So it, it is creating the resources all the time.
So it is creating and destroying resources all the time.
And then ephemeral.
So instances are like short lived.
They, and because of that, it is really hard to actually track what happened.
Because by the time when we start investigating something, the maybe,
The instance or the platform is not there without observability.
We are basically, essentially like we are flying blind.
So when something goes wrong, we need to quickly identify the root cause, whether
it's like a misconfigured service, a network bottleneck or a cascading failure.
Let's look at What lack of observability can do with us?
what are the consequences of the lack of observability?
The very first is the increased mean time to resolution.
So if we don't have a proper observability setup, diagnostic, diagnosing and,
diagnosing and resolving issues.
It takes significantly longer, which can impact the user experience
and operational efficiency.
for instance, if it takes like few hours to recover from a failure, then it can
have severe consequences on the business.
in ideal situation, if we have Observability is set up.
We should be able to quickly identify the root cause and which will help reduce the,
the, reduce the mean time to resolution.
The next is the limited, performance optimization.
So if the, if we don't have a proper observability setup, then performance
bottlenecks in one microservice can cascade affecting the overall application
and we won't be able to pinpoint exactly what is causing the problem.
So without matrix and traces, optimizing performance then
becomes just a guessing game.
Like we can try to like optimize something, but we don't
know exactly the root cause.
So for instance, if we see that there is an increased response time
for a given endpoint, then if we don't have an observability setup,
then we might believe that this endpoint is slow, it has like slow.
like a slower implementation or it needs some kind of optimization.
But in reality, there is a possibility that at that specific time, maybe that
the response time from database increased.
But because we didn't have observability set up, so we believe
that our endpoint is, it's slower.
And we cannot identify that database was the actual root cause in that.
And the next is like the difficulty in ensuring reliability.
So yeah, before, like if there is any, like if there are any gaps in the
observability, then we cannot ensure proper reliability because, the number
of microservices grows, ensuring the reliability of the entire systems.
It becomes challenging without a holistic view of the system's health.
Let's look at the steps for implementing observability into
the cloud native application.
So on a high level, there are five steps.
The first is the instrumentation.
Instrumentation is basically the foundation of observability.
And here we need to instrument our code to collect metrics, logs, and traces.
We can use libraries and frameworks that are designed for our
chosen language and frameworks.
The next is the centralized collection and storage.
So Collecting data from all the services and then storing
it in a centralized location.
We can use tools like Prometheus for metrics, Elasticsearch or Loki for logs,
and Jaeger or Zapkin for traces here.
so once we have the data instrumentation set up and then once we have the
data centralized in, in one place.
The next part is the visualization and analysis.
So here we can use, dashboards and visualization tools like Grafana to
explore the data and gain insights.
We can also set up the alerts to be notified when there is any
critical issue in the system.
So once we have the.
With visualization analysis done, we can move towards contextualization.
So here we can connect our metrics, logs, and traces together to provide
a holistic view of the system.
This allows us to correlate events and understand the relationships between
different parts of the application.
The final step is basically the automation.
we can, we can automate the whole process of collecting, analyzing and
visualizing the observability data.
So this allows us to focus on understanding the system
rather than managing tools.
So using all of these, five steps, we can, implement observability, properly
into the cloud native applications.
Let's look at the.
Famous tools and technologies that are available in the market.
the very first is the, for metrics, we have, we can use
Prometheus, Telegraph or StatsD.
For logs, it's mostly Elasticsearch, Locky, Filebit and Fluentd.
For traces, it is, like the most famous is OpenTelemetry.
there is also Zipkin and Jagger.
For visualization, yeah, the most famous.
Are, Kibana and Grafana.
yeah, some of these tools are open source.
Some of those might not be.
let's look at the platforms.
So why we need those platforms.
because we need those platforms because.
They provide us like all the features in one place.
We don't have to set up individual steps.
We don't have to set up individual tools and technologies and
try to connect everything.
So using a platform can help get, speed up the integration of the observability
into our cloud native applications.
The first is the Datadog.
So Datadog is a market leader.
It offers comprehensive monitoring and analytics across cloud native
environments, including logs, metrics, traces and events.
The next is the New Relic.
So New Relic is a, it's a comprehensive observability platform with strong
application performance capabilities.
it is also a very famous platform used by like millions of users across the world.
The third in this list is the Honeycomb.
So Honeycomb, it stands out because it is focusing on high cardinality data.
So which means high cardinality data means the data with
have like many unique values.
And honeycombs excel at event driven instrumentation So instead
of just monitoring a matrix like cp usage or request count honeycomb.
Let's let us Analyze individual events.
For example a user logged in an api call or a database query so we can this is
how you can analyze individual events using Honeycomb and if we just look at
all the, like the, the key features which are available in all of this, platform,
there are more platforms other than this, but these are the top three in my list.
The key features are always like the real time insights.
So we are collecting and analyzing metrics, log, traces, and events
from various sources to provide real time visibility into application
performance and infrastructure health.
All of these platforms, they support microservices.
So money microservices have a really complex distributed, architecture.
but yeah, or using these platforms, we can easily monitor all of those
microservices in just one place.
And They all of them allow us to create customizable dashboards.
So for instance, if I want to see some trend over time, what is the P50, P90
for my data, for the events, for, for any kind of metrics, we can easily set up the
customizable, we can set up the customized dashboards, using any of these platforms.
Here, let's look at the best practices.
So The very first thing is we have to instrument early and often, we shouldn't
just leave it till the last moment.
the earlier we start implementing observability into the cloud
native, the less painful it will be when the things start scaling.
and then for metrics, we have to use the meaningful metrics.
So we can focus on the metrics that are relevant to the business.
And we have to filter out signal from noise.
There can be a lot of metrics, but we have to filter a list of metrics that
are most relevant for the business.
And then for logs, The best practice is to actually structure the logs.
We have to use a consistent format and make it easier to
search and analyze the logs.
So by, like for instance, we have to, make sure that, like when we are logging,
something, logging events, so the, for example, there can be standard fields,
like the pattern can be standard.
For example, there's a timestamp, there is a system name, there is an application id.
There can be a lot of stuff, but yeah.
Whatever it is, it has to be a consistent format and then, it's better to if we
are using the same logging format across all the services because all of these
logs might end up being in the same place and then it will be easy to analyze.
if the format is the same.
correlation, data correlation.
So here we are connecting metrics, logs, and traces together to provide
a holistic view of the system.
and then we have to automate the observability pipeline.
So So all the process of collecting, analyzing, and visualizing the
data, it has to be automated.
there, there shouldn't be any manual intervention required whenever we want to,
whenever we want to visualize something.
So I just need to open the dashboard, open a platform.
Yeah.
The data should be already there.
Everything should be, set up, in an automated way.
The final thing is the cost management.
So observability can be expensive, especially at scale.
If we are logging, like billions of event, millions of event, per day, per hour.
So we have to be careful about, like the cost that are associated.
with the storage, with the processing, and all that.
So, to reduce the cost, to reduce the data that we are collecting, we can set
up some data retention policies and use like tiered solutions to manage costs.
Let's look at a case study.
assume that there is a cloud native, e commerce experience, there's a
cloud native application which is experiencing periodic slowdowns.
So We can use the three pillars of observability to actually pinpoint
what is the root cause of this.
So the very first is the logs.
So logs will tell us if there is an increase in the error
rates in a microservice.
Metrics will tell us if there is a high CPU utilization on a node.
The traces will pinpoint a slow, a slow database query affecting response times.
using logs, metrics, and traces, we have identified that, yeah, the problem
goes back to a slow database query.
For resolution, It's straightforward.
We can optimize the query.
We can auto, auto scale the service and, yeah, then it will reduce the latency.
So we can use these three pillars of observability to find out
what is the root cause of the periodic slowdowns in a system.
thank you.
I hope this talk was useful and, thank you.