Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi.
Welcome to the session of OB Observ 2.0.
Much more than just logs, mattress and crisis.
Hi, this is Neil Sha, a DevOps community guy.
I work as a Toyota advocate and middleware, and I also organize
a lot of communities around DevOps in my local region like.
Google Cloud, CNCF, HashiCorp, another, more than Encrypt plus
hackathons and have given talks in 10 plus conferences, including
platform con, KCDQ, con Linux phase.
So coming to our student's agenda is evolution.
Open tele role, reducing downtime with order collector.
ML in already optimizing already cost AI driven future and takeaways.
So seeing the ation observ already around 2015 people were also
using the monolith applications.
So they were using tools like material and Grafana, and then it
came the era of tele explosion.
A lot of data coming out from application infra and things, and there
are a lot of chaos, leaks of data.
So it created a scenario where a lot of data is there, but it becomes
very challenging to tackle that.
Annie, around 20 30, 20 23.
The era of AI came and people are using AI to improve their observing spectrum.
So from firefighting to force it.
Imagine a scenario where it's 2:00 AM you and your application production
is down and there are tons of data floating out, and you don't know
actually what you need to dig in and.
It takes a lot of manual effort for that to understand the problem
and meanwhile there is downtime.
It also affects the business cable.
Now we in a better world, we also already system NetX software is early.
You have AI flag the growing risk before any outrageous.
You act proactively avoiding the incident altogether.
Firefighting to force availability 0.0 from, we have shifted from
DIQ to predictive just like this.
We don't want manual troubleshooting.
We want auto explain incidents so that we don't need to hurry at the last
time and do a lot of things manually.
Next comes the enhanced capabilities.
There are correlation contextualization, so there is automatic data
correlation from front end to back.
There is AI ion, there is extra root cause because coming with a lot of automation
of ai, we have required a lot of kids.
That helps us to understand and have a better root cause.
In today's journey and 2.0 we have, or we'll covering the evaluation already.
Actually we covered.
We'll cover the rise of open Telemetry ml, LLM, already
reason downtime with collector.
We understand how the order connector works.
We'll also see the cutting costs with smarter ingestion and that
will be AI driven future already.
So here's the evaluation.
Also already.
So there are a lot of sea load data sources, sea load being
there are a bunch of data.
We don't know actually what to do that with that we are performing
annual root cross analysis.
A lot of time we are growing telemetry volume leading to
higher cost and complexities.
Then come the evolution.
Unified Telemetry with Open Telemetry.
It comes with proactive anoma reduction with insights and
also optimize data pipelines.
Then it came the rise of Open Telemetry.
Open tele before Open Telemetry.
The thing was people were using and collecting a lot of data,
but it was in different format.
So if a person A wants to share their data with person B, it was not compatible
and it was creating a lot of chaos.
Then the Rise Open came, it created a vendor over those framework
for collecting telemetry data.
Next is Rise Open.
Open Telemetry is one of the biggest project of ENCF after Kubernetes.
So if you go and search on CNCF, a lot of people are contributing to Open
Telemetry and a lot of people are using open Telemetry in their application back
ends backs for observative purposes.
Why open telemetry?
Because it created a unified standardized.
Way for instrumentation.
It created a easier correlation between different signals.
It created a freedom to route data to multiple backends and automated platforms.
Next was extensible processor and exporters.
What it created an impact on is introduced a lot of engineering from monitoring.
It enable a smarter pipeline for data processing.
IT form the foundation of AI driven platforms or systems or collector.
So collector is the core function or collective core component of open.
It consists of three major things.
That is receiver, processor, exporter, receivers, patches,
the data from your application.
Processors transform the data in which manner you need to feed to expo.
You need to feed to some SaaS.
Stacking an exporter.
Does that expose the data to some of the SaaS platform?
Like middleware is a full stack platform and if you want to understand more on
hotel collector, you can scan this code, you can, there will be a blog on that and
it's detailed blog on hotel collector.
Next ml and there are a lot of anomalies that are detected by ML model.
They also detect a lot of in incident patterns when before
we were not able to do that.
And it also prioritize in an incident based on severity and impact production.
LLM also have a summarized incidents in human readable language.
Also predict the potential future issue based on the historical patterns.
If you see LLM was already then, some of the major things are monitoring mattresses
or performance and also improving that actually issue identification.
And also health and reliability enhancement MLS are ransom,
the from passive monitoring to intelligent problem solving.
So what are the key data types in other 2.0?
What are the events?
Then?
Its profiles, dependencies and EVPF.
The EVPF changed a lot of things because.
It can fetch the data from the ker level, all the request data, and it
can also figure out a lot of things, which posts are open on, which
packets are transforming which roots, all the things are triggered by it,
and all the things can be tracked from the, is a level of security.
So comparing the features already 2.0 with the traditional already.
So I just wanted to have overview for you to better understand
how the different things works.
And the first thing is log collection.
I will just go on one of two.
Example.
So log collection was manual, it was fragmented by 2.0.
It created automated pipelines and it was centralized.
And similar manner, the feature, different features have been
changed, so already cost explosion.
So there are a lot of data, massive telemetry growth due to a lot of
microservices, which needs a use storage, and it also costs a lot storing
everything, every log, but mattress traces even when much of it is irrelevant.
Traditional already models are not designed for cloud native scalability.
Key cost drivers are high, currently trics logs and over assembling of crisis.
It resulted in increase infrastructure spend, slow, a query analysis,
performance and noise, unnecessary noise, unmanageable noise.
Then comes smarter data pipelines.
So they have to reduce the noise.
In middleware, we have a feature called ingestion control pipeline that
helps to exclude the unnecessary logs from being stored to storage, like
S3 bucket or some external storage.
So what are the key strategies?
Some of the key strategies are dynamic sampling, filtering, and deduplication.
Aggregation edge processing.
So these are some of the techniques and strategies that help to reduce the
noise and also helps reduce the cost.
These are the benefits that I already covered and what are
the AI and the future already.
So one of the main major thing that I love over here is to
consider a scenario where you have.
Alert that your CPU is increased.
CPU utilization is more than 80%, and the thing is you need to increase
the storage or double down the CPU.
So the alert will be, you have 80% plus CP utilization, and if
you want to double down the CPU, you just press you just click yes.
The CP should double down.
Double down, so it helps a lot of things.
Next is AI assist root cause.
AI can also help a lot of things to understand better and to do a thing.
A lot of automated things.
And it also nowadays, people are also using LLM already to improve the
performance of LLMs that they are using.
This is a future, A already trends.
In 2025, we have covered almost major of the themes, AI
integration, open telemetry, ai, workload, automations and all.
So what are the outcomes from ai?
And future already is faster recovery from incidents, greater reliability
and engineer focus on innovation instead of manual monitoring.
If you want a detailed log on 2.0, you can scan this code.
There is a detailed blog covering all the spec on of 2.0.
You can connect with me on LinkedIn or any platform if you have some queries.
Thank you for listening to me.
Hope you have a great day here.