Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
This is Tawa.
I'm a senior network engineer, senior network delivery lead at Cairos
Technologies based out of Dallas, Texas.
Today I'm going to talk about building prediction ready,
ML based network routing,
and we discuss about traditional routing, ML based routing, compare
both ML and the traditional routing.
Also, I'll talk about, how to train models and the real case studies.
And then also ROI.
And in implementing ML based routing,
the evolving network routing landscape routing protocols like B-G-P-O-S-P-F
have been around for decades.
They are reliable, but they were designed for a different time.
They use fixed rules.
They take too long to recover when something breaks and still.
Rely a lot on manual tuning from engineers.
In today's world of cloud microservices and real time
application, that's not good enough.
We need something smarter, faster, and more adaptive to the current scenarios.
ML based routing coming to the core concept.
ML based routing brings flexibility, which is the key, instead of
relying on the fixed rules, ML system learns from network data.
They watch traffic spot patterns and make predictions.
That means they can route traffic more intelligently, balance, speed,
cost, and reliability all at once.
They can even direct unusual patterns that might.
Mean a failure or a system or a security issue, and the best part,
they get better as they learn and adapt to the current environment.
Coming to the technical architecture of the ML based
routing, let's see how it works.
First, collect data collection.
We collect telemetry, all the signals from the network, like
latency packet loss, and throughput.
That data goes into a pipeline that cleans it up and makes it usable.
Then we have the ML training system where models are built, tested, and stored.
Finally, we have inference system, which applies those models in real
time to make routing decisions.
And of course, we always keep fallbacks and safety switches so that the network
stays stable if something goes wrong.
Let's compare ML based routing versus traditional routing.
Traditional routing is rule-based.
It reacts only after a problem happens.
ML routing is smarter.
It took it.
It looks at hundreds of factors, predicts issues before they even happen
and can reroute traffic proactively.
So instead of waiting for something to break, the network
can fix itself in advance.
Coming to data collection and model training approaches.
Data is the fuel here.
We need to collect everything, link usage, latency, packet loss, application
patterns, and even things like time of the day and scheduled maintenance when
it comes to training data options like supervised learning, using parts, past
best practices to train the models and reinforcement learning where the model
tries things and learns by feedback.
Or the hybrid approach, we can mixing m by mixing ml with traditional routing,
the this is the better and most diverse.
Diverse the data and the smarter the model
coming to the production deployment patterns.
You don't just slip a switch and hand over your network to ML just overnight.
It actually need to go rolled out through a phased manner, coming to the
first phase, which is the shadow mode.
ML makes decisions, but only on papers.
Advisory mode, ML suggests routes.
And when, and the engineers need to approve them.
And selective automation, which is a third phase ml, takes over
certain traffic, like critical data, non-critical data and certain
regions which are of low criticality.
And then coming to the final phase, which is a full automation ml, runs the
show with all protocols as a backup.
This step-by-step approach builds trust and avoids surprises.
Coming to monitoring observability challenges monitoring ml routing isn't
just about checking if links are up.
We also need to see how the model is doing, how accurate it is,
whether it's drifting, and how confident it is in each decision.
Dashboards need to show side-by-side comparisons of ML and traditional
routes and explain why the model picked a certain path.
This transparency builds confidence and helps operators trust their system.
Coming to security consideration for ML routing, security is a very key.
If someone poisons your data, they can mess with your routes.
If they figure out your model, they can exploit it.
So we need to product.
En encrypt the telemetry, validate inputs, use anomaly detection, and make
sure there's always a human override.
In short, the smarter the system, the smarter the security has to be.
Operational considerations.
Running ML routing day to day brings new challenges.
What happens if the model suddenly starts making weird choices?
What if conference drops?
What if telemetry breaks?
We need run books for all these things, like how to roll back to a safer
version, how to gradually shift traffic, how to retrain after an incident.
Think of it like adding a new playbook to your knock.
But m but for ML in specific failures, real world case studies,
let's look at the real examples.
In financial services, ML routing cut latency spikes by almost 70% and saved
millions in circuit costs in e-commerce.
ML reduced page load time by 30% during traffic peaks and
cut manual interventions by 95%.
But they were challenges, compliance concern, false
alarms and model performance dropping during unusual events.
The payoff is real, but you need patience and good planning to for a success.
ML implementation.
RI.
Assessment and risk management.
Leaders always ask two things.
What do we gain?
What do we risk?
On the gain side, we see faster recovery, low la, lower latency, less manual
work, which means real cost savings.
On the other side, there are skill gaps, data issues, model drift.
The answer is balanced, failed, phase rolled rollouts, strong fallbacks,
and training your team to handle both ML and traditional routing together.
Implementation, roadmap and best practices.
Phase one, get the telemetry and baseline in place.
Coming to the phase two shadow model testing and phase three limited
rollout for non-critical traffics.
And phase four is the full deployment with automation and retraining pipelines
coming to the team.
Structure and skills.
This only works if you have right people.
You need network engineers, data scientists, platform engineers and
ops teams all working together, each bringing in different skill sets
and and you need cross training so they understand each other's world.
To close.
Here are key points.
ML routing is ready enough for the real use cases.
Hybrid approaches like ML plus traditional routing work best in today's scenario.
Operations monitoring and people matter as much as the tech.
The future is very clear.
Networks will become self-healing and adaptive.
The teams that start now will be the ones leading the future.
Thank you for this opportunity.