Building Production-Ready ML-Based Network Routing: Lessons from Real-World Platform Engineering Deployments

Video size:

Abstract

Production ML routing systems are crushing traditional protocols—40% faster convergence, fewer outages, real ROI. Platform engineers reveal actual architectures, deployment gotchas, and code you can use. No marketing fluff, just engineering wins.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. This is Tawa. I'm a senior network engineer, senior network delivery lead at Cairos Technologies based out of Dallas, Texas. Today I'm going to talk about building prediction ready, ML based network routing, and we discuss about traditional routing, ML based routing, compare both ML and the traditional routing. Also, I'll talk about, how to train models and the real case studies. And then also ROI. And in implementing ML based routing, the evolving network routing landscape routing protocols like B-G-P-O-S-P-F have been around for decades. They are reliable, but they were designed for a different time. They use fixed rules. They take too long to recover when something breaks and still. Rely a lot on manual tuning from engineers. In today's world of cloud microservices and real time application, that's not good enough. We need something smarter, faster, and more adaptive to the current scenarios. ML based routing coming to the core concept. ML based routing brings flexibility, which is the key, instead of relying on the fixed rules, ML system learns from network data. They watch traffic spot patterns and make predictions. That means they can route traffic more intelligently, balance, speed, cost, and reliability all at once. They can even direct unusual patterns that might. Mean a failure or a system or a security issue, and the best part, they get better as they learn and adapt to the current environment. Coming to the technical architecture of the ML based routing, let's see how it works. First, collect data collection. We collect telemetry, all the signals from the network, like latency packet loss, and throughput. That data goes into a pipeline that cleans it up and makes it usable. Then we have the ML training system where models are built, tested, and stored. Finally, we have inference system, which applies those models in real time to make routing decisions. And of course, we always keep fallbacks and safety switches so that the network stays stable if something goes wrong. Let's compare ML based routing versus traditional routing. Traditional routing is rule-based. It reacts only after a problem happens. ML routing is smarter. It took it. It looks at hundreds of factors, predicts issues before they even happen and can reroute traffic proactively. So instead of waiting for something to break, the network can fix itself in advance. Coming to data collection and model training approaches. Data is the fuel here. We need to collect everything, link usage, latency, packet loss, application patterns, and even things like time of the day and scheduled maintenance when it comes to training data options like supervised learning, using parts, past best practices to train the models and reinforcement learning where the model tries things and learns by feedback. Or the hybrid approach, we can mixing m by mixing ml with traditional routing, the this is the better and most diverse. Diverse the data and the smarter the model coming to the production deployment patterns. You don't just slip a switch and hand over your network to ML just overnight. It actually need to go rolled out through a phased manner, coming to the first phase, which is the shadow mode. ML makes decisions, but only on papers. Advisory mode, ML suggests routes. And when, and the engineers need to approve them. And selective automation, which is a third phase ml, takes over certain traffic, like critical data, non-critical data and certain regions which are of low criticality. And then coming to the final phase, which is a full automation ml, runs the show with all protocols as a backup. This step-by-step approach builds trust and avoids surprises. Coming to monitoring observability challenges monitoring ml routing isn't just about checking if links are up. We also need to see how the model is doing, how accurate it is, whether it's drifting, and how confident it is in each decision. Dashboards need to show side-by-side comparisons of ML and traditional routes and explain why the model picked a certain path. This transparency builds confidence and helps operators trust their system. Coming to security consideration for ML routing, security is a very key. If someone poisons your data, they can mess with your routes. If they figure out your model, they can exploit it. So we need to product. En encrypt the telemetry, validate inputs, use anomaly detection, and make sure there's always a human override. In short, the smarter the system, the smarter the security has to be. Operational considerations. Running ML routing day to day brings new challenges. What happens if the model suddenly starts making weird choices? What if conference drops? What if telemetry breaks? We need run books for all these things, like how to roll back to a safer version, how to gradually shift traffic, how to retrain after an incident. Think of it like adding a new playbook to your knock. But m but for ML in specific failures, real world case studies, let's look at the real examples. In financial services, ML routing cut latency spikes by almost 70% and saved millions in circuit costs in e-commerce. ML reduced page load time by 30% during traffic peaks and cut manual interventions by 95%. But they were challenges, compliance concern, false alarms and model performance dropping during unusual events. The payoff is real, but you need patience and good planning to for a success. ML implementation. RI. Assessment and risk management. Leaders always ask two things. What do we gain? What do we risk? On the gain side, we see faster recovery, low la, lower latency, less manual work, which means real cost savings. On the other side, there are skill gaps, data issues, model drift. The answer is balanced, failed, phase rolled rollouts, strong fallbacks, and training your team to handle both ML and traditional routing together. Implementation, roadmap and best practices. Phase one, get the telemetry and baseline in place. Coming to the phase two shadow model testing and phase three limited rollout for non-critical traffics. And phase four is the full deployment with automation and retraining pipelines coming to the team. Structure and skills. This only works if you have right people. You need network engineers, data scientists, platform engineers and ops teams all working together, each bringing in different skill sets and and you need cross training so they understand each other's world. To close. Here are key points. ML routing is ready enough for the real use cases. Hybrid approaches like ML plus traditional routing work best in today's scenario. Operations monitoring and people matter as much as the tech. The future is very clear. Networks will become self-healing and adaptive. The teams that start now will be the ones leading the future. Thank you for this opportunity.

Slides

Download slides (PDF)

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Production-Ready ML-Based Network Routing: Lessons from Real-World Platform Engineering Deployments

Video size:

Abstract

Summary

Transcript

Slides

Rahul Tavva

Network Delivery Manager @ Kairos Technologies Inc

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Production-Ready ML-Based Network Routing: Lessons from Real-World Platform Engineering Deployments

Video size:

Abstract

Summary

Transcript

Slides

Rahul Tavva

Network Delivery Manager @ Kairos Technologies Inc

Join the community!