Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
Welcome to this session.
My name is Ajani and I'm excited to share my approach of how one of the
most pressing challenges in telecom industry assuring reliability in an
increasingly complex environment as it as telecom networks evolve and adapt.
To cloud native architectures, the traditional ways of
monitoring and managing service health are just not enough.
This talk walks you through our transf transformation from a reactive monitoring
tool, proactive AI powered observability.
Let's dive in.
Before we dive, let me share a quick story.
A couple of years ago.
We had a critical outage.
We, and we were flooded with alerts.
Dozens coming in from different systems.
None of them are clear, so it took hours to trace a root cause it turned out the
signals were that there were days earlier.
That moment made something clear.
We don't need more alerts.
We need.
Better intelligence.
So let's talk about scale.
Telecom industry and telecom networks can generate up to 25 terabytes
of operational data each day.
And the number is only growing with 5G and iot as raises.
But there is the problem.
Here's the problem.
Despite all the data, nearly 70% of all the outages are preventable.
We just don't have the right tooling in place to catch early signals.
Air helps bridge the gap in our experience.
Applying anomaly detection reduces mean time to repair by over 40%,
which is massive in an industry where seconds of downtime costs millions.
This sets the foundation of our shift in mindset from data
overload to data-driven action.
And here's the bus, here's the business reality.
Every minute of telecom downtown downtime can cost thousands,
if not millions, garner.
Gartner estimates the average cost of around $55,600 per minute.
When we reduce meantime to repair by 41%.
It wasn't just a technical win, it was a massive cost saving SLA boosting outcome.
Traditional monitoring has three big flaws.
The first, it's reactive.
We find issue after.
They, it's already affected the users.
Second, it's siloed means which each part of the system, our network
and infrastructure and applications all are monitored separately.
The third is context.
So Alexa, based on narrow metrics without any bigger picture to it to understand
it makes this makes incident response slower and troubleshooting harder.
It's like trying to solve a puzzle with half the pieces missing.
Observability changes the game.
Instead of focusing on predefined metrics and thresholds, it's about understanding
the system behavior as a whole.
We collect telemetric data across domains, metrics, logs, traces.
Use AI to spot anomalies before they become incidents.
So dynamic baselining helps us to avoid alert storms of.
From normal fluctuations, while self-learning models continuously
evolve, this allows us to predict, prevent, and self-heal.
It's not just a monitoring monitoring anymore, it's just it's intelligent
service assistance or assurance.
So how does AI enhance the observability components?
So AI enhance observability has four pillars, metrics, so high
dimensional time series data.
So where subtle shifts in can reveal performance degradations,
and we can also have traces.
These traces provide end-to-end visibility into service
requests across microservices.
Invaluable in modern cloud native systems and logs.
The story behind the numbers, offering a rich context with error
codes stacked tracings, and more.
And then comes to a analysis.
This is the brain.
It co co. It correlates all the data across the data types and
identifies patterns, predictive issues, and then prescribes fixes.
Together these elements give us real time con, contextual and predictive
insights into service health.
Let me get a bit more technical for metrics we use.
We used LSTM based models to understand the temporal anomalies for logs.
We applied natural language processing to surface meaningful
patterns across unstructured data.
This hybrid approach combining the structured time.
Series with unstructured logs lets us correlate signals across
the systems in a way that the traditional tools could not match.
Okay.
Our journey started with building the data foundation, so aggregating 18
months of telemetry from various systems.
Then we trained deep learning models specific to telecom
patterns, which is very different from standard enterprise data.
Next, we integrated tracing every layer from hardware to.
Virtual functions.
And most importantly, we work to work closely with the ops team to
continuously refine that what we build.
It was a feedback driven loop, test, learn, improve, so it, so the loop
was essential for building and driving the adoption, but the biggest hurdle.
Wasn't the just the tech, so it was the culture.
People were skeptical, obviously.
So we bought Operation Team on from the day one.
So we didn't just build tools for operation team,
we built with them for them.
So they tested the models, flagged the false positives, and suggested new inputs.
So that buy-in changed everything.
So let's talk outcomes first, we saw a major boost in detection, accuracy.
We were catching issues we never saw before.
Mean meantime to repair dropped, which meant fewer escalations
and faster resolutions.
And we also saw fewer service impacting incidents overall.
And because our alerts were very more accurate, the ops team
experienced less test fatigue.
Alert fatigue, sorry.
So this could focus on real problems, not chasing the.
Ghosts From business perspective, this translated into better SLAs, happier
customers, and real operational savings.
Of course, this did not come without challenges.
First, there is a large data volume to handle terabytes of data per day.
We had to implement edge filtering and intelligent sampling without losing.
And cross domain core correlation.
We developed a unified entity model to that tied together infrastructure,
virtual functions, and services.
And on top of that, we had an explainable e ai, which we built using visual tools
that led operations to see why there was an alert trigger, improving the trust
and cutting the false positives by 27%.
Overall, these made our solution usable, scalable, and real world
in the real world environment.
So here's what the architecture looked like.
So there is a data layer.
Distributed agents collected high fidelity telemetry from across the
network, and then there is this processing and analytical layer.
We used stream processing with ML pipelines to analyze
the data in real time.
Then there is knowledge and automation layer.
This layer drove automation, auto remediation, and model refinement.
This structure ensured we could scale the whole positive response and enable closed
loop operations with no human in the loop.
Then if achieving this fineness reliability in the holy Grail.
Holy Grail in telecom, our observability, a lube based help us helped us
here we went from anomaly detection to root cause analysis with auto
remediation and back to back learning.
Each incident made the system smarter.
This closed loops feedback, that is what allowed us to quickly recover
and improve the system continuously.
For anyone who wants to start this journey as a, here is the roadmap to follow.
Assess your current stack and identify where you lack observability, and
then pilot one domain to validate your architecture and models.
Then from there you have to scale the solution gradually, making sure
the data is unified across domains, and then automate remediation.
Using AI to truly close the loop, a phase approach lets you build
momentum, demonstrate value, and then reduce deployment risks.
And that brings us to the end.
So we've seen how observability, when combined with AI, turns raw
data into actionable intelligence.
This just isn't about cool tech.
It's about delivering more reliable, scalable services
to the people who count on us.
And thanks so much for watching, and if you'd like to
connect or have any questions.
Please feel free to connect our chat with me.
I will allow that and thank you.