Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody.
Good morning.
This stock explores how network infrastructure defines platform potential.
Traditional webs apps scale differently than a platforms which require ultra
low latency and massive scalability.
And the network can either empower or triple modern systems.
I'll take a deep dive how a network is very crucial in modern era.
Your network is, we all say the weakest link, the strength of a
chain is its weakest link, so it's applicable to network as well.
The network is either your platform superpower or actually network here.
All the devices, which do does the.
Packet processing capabilities, how the packet is far from any user, all
the way to the server and the written packet from the server to user.
You experience the internet, its fastness, it's.
How you feel as a user.
It that's the one decides any product in the market, be it, is its
simple application or be its very e-commerce platform or it an app.
So underlying there is so much networking concepts.
There is so much networking is involved.
So today we're looking into how some of the case studies, how
networking plays a key role.
We'll cover architectural evaluation networking failure models, modes,
sorry principles of fault tolerance, network observability strategies,
and real world case studies.
And those con around those concepts, we have more from traditional
three tide architectures.
There were core distribution access to spine leaf models
where two spines are connected.
Three lead switch optimized for particularly these optimized
for east waste traffic.
Because of the tremendous AI and tremendous activity in the data centers.
The traffic not only is from the north to south, meaning the user to server,
but also there is a lot of computation.
There, there are a lot of calculations involved within the data center before
the response is sent to the user.
This has become more.
Even more prevalent in a data centers where the distribution nodes have
grown tremendously from two to four, a simple web scale traffic to one hundred
and twenty eight, two to 56 nodes in the case of a the case of a network.
Software defined networking, of course, because of the lot of
nodes involved in the data centers, software defined networking,
which enables the programmability.
Basically, you are able to program a lot of devices at once so that the DevOps,
those particularly, the duties revolving around the DevOps will become but.
Much more easy and predictable, less prone and et cetera.
Of course, a scale requests specialized.
The six and delta low latency fabrics.
There is so much of hardware there is so much development in
the hardware also is happening.
Particularly asics are built with ultra high speeds and ultra low latency.
The whole data center fabrics are built particularly.
So with, to handle the high bandwidth and low latency, each stage reflects
growing demands for bandwidth, low latency, and flexibility.
Traditional versus modern network requirements, web applications typically
need two to four backend connections.
Tolerate 5,200 milliseconds, latency and follow predictable.
They all follow predictable traffic.
AI workloads connect 128 to 2 56 plus loads needs terabits, but
second scale bandwidth and require less than 10 milliseconds latency.
They're bursty unpredictable and dominated by e stress traffic.
So if you take a case study, just a simple cascading networking failure
during peak load, lets say a financial AI platform experience, a p timeouts.
The loss is huge.
The confidence will be shared, the confidence will be hit low.
There are so many problems that can cause if a financial, a
platform experience, some timeouts.
So particularly this can happen I mean it's identified
like tcp in cast congestion.
Basically tcp in cast congestion is where multiple servers respond to aggregator
node at once, this cascades into a lot of packet drops, transmission de
transmissions, and service degradation.
In data centers nowadays, we have this remote direct memory access,
our converge ethernet with flow control to prevent collapse.
So this will be called Rocky, E-R-D-M-A or converge ethernet.
So this is one case study where we can apply R-D-M-A-R-D-M-E-A so that you,
the, and the congestion can be avoided.
So network bottlenecks very few key bottlenecks, buffer blot.
There's lot of.
Buffer size increase so that increases the latency.
Packet processing limits traditionals, which can't handle billions of packets per
second topal, they can hand the new sx.
They handle a lot of traffic, so that also has to be included.
Topology constraints like war subscription ratios, break and ray workloads bandwidth
saturation ML training can overwhelm.
Bits per second links in seconds.
So how should a network design be?
How should a fault tolerant network design be?
Principles?
T fabrics meaning you have a path, an alternate path in case
of failure, graceful degradation and isolated failure domains.
Like you, you isolate a failure domains so that you.
Traffic is not impacted or only less traffic is impacted.
So for to have some of the strategies like ECMP equal Custom MULTIPATH for
path diversity, as I said, there is an alternate path for the traffic to flow
and bidirectional forward detection for a subsequent fail for subsequent
failover, meaning at a very hardware level, if there is a. Failover of a link
between two devices that will be easily identified, that will easily propagate
to the upper layers, and the alternate path is identified for the traffic and
segment routing for traffic engineering
modern spying.
Leave architecture for a clothes.
The advantages are predictable latency, linear scalability,
optimized for re traffic.
Design considerations.
Keep wall subscription below three, three to one in the sense of our subscription.
In the data networking refers to the bandwidth available, bandwidth
proposed to the downlink servers, to bandwidth available to the uplink.
So always.
So which means for three down links there is at least one
uplink path should be available.
That's what I meant by our subscription use ECMP diversity.
There's again, equal cost multi-part, so the cost, the path calculation by
various protocols is done and inserted into the routing tables so that traffic
can be spread across multiple parts.
So that the bandwidth is efficiently utilized, size buffers for microbus,
meaning there are some, there is, there are some traffic patterns
where microbus can happen so that your buffer size can be dynamically
increased to accommodate those microbus.
Consider RDMJ remote direct memory access for ultra low latency.
This is network applicability, also play a key role because you, we better
find problems before the users do.
For example, I mean there are telemetry, collect telemetry, like
one second metrics that might be too fast, but that might be too frequent.
But you can have some regular interval telemetry.
Monitor KPAs like latency distributions, utilization retransmissions use machine
learning for anomaly detection and digital twins for what if analysis.
This observability always plays a key role when you want to avoid the
problems before even there do happen.
So that observability really plays a key role.
The hidden cost of network latency.
In a ML pipelines, even one milliseconds of latency can add minutes to training.
32% is training efficiency loss, 2.5 times infrastructure cost multiplies.
47% is wasted.
GP cycles and 82 millisecond in inference latency, I have to use a response.
Time networks dealers directly translate into cost and time to market risks.
Legacies switches process the 1.2 billion packets per second.
Yay.
Workload needs three times, like 3 billion packets per second.
Modern IC add programmable pipelines, hardware accelerated RDMA, beat
buffers congestion management and sub microsecond forwarding vendors like
Nvidia, Broadcom, and Intel, they design silicon for AA traffic patterns.
Just a case study for an e-commerce platform transformation
Initially three times.
There are 10 gig of per second links seven, two second, seven to nine second
page loads After microservices, east west traffic goes to 60 percentage
and latency spikes after adoption requires a hundred GB per second,
gigabits per second links for G cluster.
Final, when final spine leaves with 400 gigabits per second backbone
cut incidents by 98.5 percentage and scale to petabytes per day.
So the spine leaf architecture, it's, it has become very powerful in the data
centers where the speed is increased to multi-fold than the previous data centers.
So the incidents are cut by 98.5% and scaled.
To petabytes.
So yeah.
Key takeaways, building for the future our for change.
Design always.
Your networks should be available in for the future changes.
Invest in observability to catch issues early.
Prioritize resilience with fault tolerant designs.
Optimized for latency since it direct impacts cost and time to market.
Your network is no longer plumbing.
It's a core platform capability.
Thank you.