Building Resilient Network Infrastructure for Modern Platform Engineering: From Legacy Systems to AI-Scale Architecture

Video size:

Abstract

Your network is either your platform’s superpower or its Achilles’ heel. Discover how to architect infrastructure that scales from web apps to AI workloads without breaking—featuring real tactics that prevent the cascading failures keeping platform engineers awake at night.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everybody. Good morning. This stock explores how network infrastructure defines platform potential. Traditional webs apps scale differently than a platforms which require ultra low latency and massive scalability. And the network can either empower or triple modern systems. I'll take a deep dive how a network is very crucial in modern era. Your network is, we all say the weakest link, the strength of a chain is its weakest link, so it's applicable to network as well. The network is either your platform superpower or actually network here. All the devices, which do does the. Packet processing capabilities, how the packet is far from any user, all the way to the server and the written packet from the server to user. You experience the internet, its fastness, it's. How you feel as a user. It that's the one decides any product in the market, be it, is its simple application or be its very e-commerce platform or it an app. So underlying there is so much networking concepts. There is so much networking is involved. So today we're looking into how some of the case studies, how networking plays a key role. We'll cover architectural evaluation networking failure models, modes, sorry principles of fault tolerance, network observability strategies, and real world case studies. And those con around those concepts, we have more from traditional three tide architectures. There were core distribution access to spine leaf models where two spines are connected. Three lead switch optimized for particularly these optimized for east waste traffic. Because of the tremendous AI and tremendous activity in the data centers. The traffic not only is from the north to south, meaning the user to server, but also there is a lot of computation. There, there are a lot of calculations involved within the data center before the response is sent to the user. This has become more. Even more prevalent in a data centers where the distribution nodes have grown tremendously from two to four, a simple web scale traffic to one hundred and twenty eight, two to 56 nodes in the case of a the case of a network. Software defined networking, of course, because of the lot of nodes involved in the data centers, software defined networking, which enables the programmability. Basically, you are able to program a lot of devices at once so that the DevOps, those particularly, the duties revolving around the DevOps will become but. Much more easy and predictable, less prone and et cetera. Of course, a scale requests specialized. The six and delta low latency fabrics. There is so much of hardware there is so much development in the hardware also is happening. Particularly asics are built with ultra high speeds and ultra low latency. The whole data center fabrics are built particularly. So with, to handle the high bandwidth and low latency, each stage reflects growing demands for bandwidth, low latency, and flexibility. Traditional versus modern network requirements, web applications typically need two to four backend connections. Tolerate 5,200 milliseconds, latency and follow predictable. They all follow predictable traffic. AI workloads connect 128 to 2 56 plus loads needs terabits, but second scale bandwidth and require less than 10 milliseconds latency. They're bursty unpredictable and dominated by e stress traffic. So if you take a case study, just a simple cascading networking failure during peak load, lets say a financial AI platform experience, a p timeouts. The loss is huge. The confidence will be shared, the confidence will be hit low. There are so many problems that can cause if a financial, a platform experience, some timeouts. So particularly this can happen I mean it's identified like tcp in cast congestion. Basically tcp in cast congestion is where multiple servers respond to aggregator node at once, this cascades into a lot of packet drops, transmission de transmissions, and service degradation. In data centers nowadays, we have this remote direct memory access, our converge ethernet with flow control to prevent collapse. So this will be called Rocky, E-R-D-M-A or converge ethernet. So this is one case study where we can apply R-D-M-A-R-D-M-E-A so that you, the, and the congestion can be avoided. So network bottlenecks very few key bottlenecks, buffer blot. There's lot of. Buffer size increase so that increases the latency. Packet processing limits traditionals, which can't handle billions of packets per second topal, they can hand the new sx. They handle a lot of traffic, so that also has to be included. Topology constraints like war subscription ratios, break and ray workloads bandwidth saturation ML training can overwhelm. Bits per second links in seconds. So how should a network design be? How should a fault tolerant network design be? Principles? T fabrics meaning you have a path, an alternate path in case of failure, graceful degradation and isolated failure domains. Like you, you isolate a failure domains so that you. Traffic is not impacted or only less traffic is impacted. So for to have some of the strategies like ECMP equal Custom MULTIPATH for path diversity, as I said, there is an alternate path for the traffic to flow and bidirectional forward detection for a subsequent fail for subsequent failover, meaning at a very hardware level, if there is a. Failover of a link between two devices that will be easily identified, that will easily propagate to the upper layers, and the alternate path is identified for the traffic and segment routing for traffic engineering modern spying. Leave architecture for a clothes. The advantages are predictable latency, linear scalability, optimized for re traffic. Design considerations. Keep wall subscription below three, three to one in the sense of our subscription. In the data networking refers to the bandwidth available, bandwidth proposed to the downlink servers, to bandwidth available to the uplink. So always. So which means for three down links there is at least one uplink path should be available. That's what I meant by our subscription use ECMP diversity. There's again, equal cost multi-part, so the cost, the path calculation by various protocols is done and inserted into the routing tables so that traffic can be spread across multiple parts. So that the bandwidth is efficiently utilized, size buffers for microbus, meaning there are some, there is, there are some traffic patterns where microbus can happen so that your buffer size can be dynamically increased to accommodate those microbus. Consider RDMJ remote direct memory access for ultra low latency. This is network applicability, also play a key role because you, we better find problems before the users do. For example, I mean there are telemetry, collect telemetry, like one second metrics that might be too fast, but that might be too frequent. But you can have some regular interval telemetry. Monitor KPAs like latency distributions, utilization retransmissions use machine learning for anomaly detection and digital twins for what if analysis. This observability always plays a key role when you want to avoid the problems before even there do happen. So that observability really plays a key role. The hidden cost of network latency. In a ML pipelines, even one milliseconds of latency can add minutes to training. 32% is training efficiency loss, 2.5 times infrastructure cost multiplies. 47% is wasted. GP cycles and 82 millisecond in inference latency, I have to use a response. Time networks dealers directly translate into cost and time to market risks. Legacies switches process the 1.2 billion packets per second. Yay. Workload needs three times, like 3 billion packets per second. Modern IC add programmable pipelines, hardware accelerated RDMA, beat buffers congestion management and sub microsecond forwarding vendors like Nvidia, Broadcom, and Intel, they design silicon for AA traffic patterns. Just a case study for an e-commerce platform transformation Initially three times. There are 10 gig of per second links seven, two second, seven to nine second page loads After microservices, east west traffic goes to 60 percentage and latency spikes after adoption requires a hundred GB per second, gigabits per second links for G cluster. Final, when final spine leaves with 400 gigabits per second backbone cut incidents by 98.5 percentage and scale to petabytes per day. So the spine leaf architecture, it's, it has become very powerful in the data centers where the speed is increased to multi-fold than the previous data centers. So the incidents are cut by 98.5% and scaled. To petabytes. So yeah. Key takeaways, building for the future our for change. Design always. Your networks should be available in for the future changes. Invest in observability to catch issues early. Prioritize resilience with fault tolerant designs. Optimized for latency since it direct impacts cost and time to market. Your network is no longer plumbing. It's a core platform capability. Thank you.

Slides

Download slides (PDF)

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Resilient Network Infrastructure for Modern Platform Engineering: From Legacy Systems to AI-Scale Architecture

Video size:

Abstract

Summary

Transcript

Slides

Bhaskararao Vakamullu

Senior Network Software Engineer @ Hewlett Packard Enterprise

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Resilient Network Infrastructure for Modern Platform Engineering: From Legacy Systems to AI-Scale Architecture

Video size:

Abstract

Summary

Transcript

Slides

Bhaskararao Vakamullu

Senior Network Software Engineer @ Hewlett Packard Enterprise

Join the community!