Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
Thank you for joining today.
We are going to explore something absolutely foundational to the rapid
expansion of ai, the network fabric.
Before we dive in, let me quickly introduce myself.
My name is Nam Khan and I'm a solutions engineer at Cisco on the data center
and the AI infrastructure team.
My job is to make sure the solutions we build align with our
customer's real business needs.
I've been in the industry for over 23 years working with companies like
Motorola, quest Communications and Cisco.
I hold a dual CCIE certification and have delivered multiple sessions at
Cisco Life across the US and Europe.
Now, why are we here today?
Because we are living through one of the biggest infrastructure
shifts in the history of computing.
Every organization is racing to build or consume AI models.
And the interesting part is this, the bottleneck in AI isn't
just the GPU speed anymore.
It's how fast those GPUs can talk to each of them.
You can buy the fastest GPU in the world, but if they're connected through
a slow or congested network, it's like driving a Ferrari in a traffic jam.
All that horsepower completely wasted.
So in the session we'll look at how ethernet, the same technology
that powers the internet.
Has evolved into high performance fabric for scalable secure
AI training and inference.
Here's a roadmap for our session.
We'll start with the silicon building blocks, the C-P-U-G-P-U and DPU.
Then we'll look at the different types of AI clusters and the specific
network requirements for each.
The core of a discussion will be on network architecture.
Focusing on RDMA, Rocky V two and Congestion Management.
Finally, we look at the future with the Ultra Ethernet
Consortium and discuss security.
Let's start by breaking down on what actually sits in a modern AI server.
It typically contains three MA major processing units.
The CPU or the Center Processing Unit, which is the brain of the AI
system, it manages the operating system, orchestrates data loading
and handles all journal purpose task.
Think of it as a conductor of an orchestra, making sure everything
stays in sync in harmony.
Next is the GPU or the graphics processing unit, which can be turned
as the muscle of the AI system.
This is where the heavy lifting happens.
GPUs Excel at massive parallelism, performing thousands
of metrics multiplication.
One in AI clusters.
GPUs are the most expensive resources, and everything about the
network is designed for one purpose.
Keep the GPUs busy.
If a GPU is waiting for data, you're literally burning money.
Finally, the DPU or the data processing unit is the traffic controller of the
AI system as network scales to 400 gig and 800 gig processing packets.
Eats up CPU cycles, what we call the infrastructure tax, the DPU offloads
set, tax handling, encryption, packet inspection and routing.
So the GPU and GPU stay focused on their workload.
Thinking of DPU as a personal assistant, making sure the main performance can shine
these three components together form the backbone of modern AI infrastructure.
So what is an AI cluster?
Isn't it just a rack of servers?
Essentially, it is an interconnected network of high performance GPUs,
acting as a single supercomputer.
The network here is not just the plumbing.
It is part of the compute fabric.
If the network slows down the entire cluster stalls.
This brings us two collective communication.
In standard networking, aox to B. In ai, a whole group of GPUs need to exchange
data simultaneously to function as a unit.
During training, every GPU calculates a gradient.
And they must all share that data to update the model.
We use topology aware algorithms like rings or trees to ensure
this happens instantly.
If this synchronization lags training, time explodes.
Now let's distribute between the workload.
Training is where the model learns.
It uses massive data sets.
To teach the model patterns.
It's like teaching a robot to read, walk, dance, whatever the robot was built for.
Inference is the application phase.
This is when you use the train model to make predictions or generate text.
It's like asking that robot a question and getting an answer.
Let's look at the specific requirement for each of these
clusters as shown in the table.
The bandwidth.
Training requires a high no to node bandwidth because of that gradient
synchronization, which we talked about in inference, it's the bandwidth
requirement is relatively lower.
The key metric for the training and the inferences for the training,
the key metric is how much time it takes to train the model.
For inference, it is about latency and higher liability.
So basically, if you think that you're building a robot.
How much time it takes to build a robot and how much time it takes to train
the robot so that it can perform the task, which is, which it's intended for.
Inferencing is just like executing the robot.
So basically the the end user commands the robot to walk, dance, or whatever
the purpose it is built for to execute the purpose it is built for.
Training happens offline, like when the robot is being built.
It is trained.
It is not available for the user.
It's still being manufactured by the company or the system,
which is developing it.
Inference is online, right?
The robot is available for service whenever it's required.
From an infrastructure requirement, training clusters are massive.
They're like centralized networks because they require a lot of bandwidth and the
more resources they have, the lesser time it takes to train a particular model,
influencing clusters, often small and distributed, something similar to what
we have been using in our data centers, the regular networks, what we have
been using in the data centers, those could be used as intrinsic clusters.
So with those differences in mind, let's talk about what it takes to build a
network that supports these workloads, specifically a lossless network fabric.
This is where technologies like RDMA, condition control and specialized
architectures come into play To get the performance we need for training, we
use RDMA or remote direct memory access.
Standard networking is too slow because data has to pass through.
The C-P-U-R-D-M-A allows the NIC to transfer data directly into G p's
memory by passing the CPU entirely.
This gives us the low latency required for a loss risk fabric.
I often joke that if humans had RDMA, you could just sit in the room
and by the end of the session, all my AI networking knowledge would be
copied into your brain instantly.
But for RDMA to work, the network must act losslessly, meaning no packet
drops even during spikes or congestion.
This requires careful planning around buffers, queues, and traffic engineering.
Now, we may say we are using RDMA as a technology.
How about the physical network for ai?
Here is a high level topology.
We typically split the network into two, which is the front end
network and the backend network.
The front end network is your standard ethernet network, which you use for your
storage, out of management, user access.
Just the regular data center which you have been using.
The bank end network is a place where the magic happens.
This is a dedicated high speed GPU to GPU fabric.
So any model you're training here, any robots you're training here, it
has to happen in the backend network.
It is purpose built and it carries only compute traffic.
The GPU gradients, which you talked about, the activation, the model TERs, all of
that happens in the backend network.
And of course, since the backend network uses RDMA because it involves G pt, GP
communication, so it has to be lossless.
Because if any packet drops the entire collective operation on
the collective communication.
Communication we talked about between the GPUs will stall or hash to restart.
For the backend network, which runs RDMA, we have two choices in terms of technology
or in terms of system infrastructure.
Either we could use Infinity Band or we could use Ethernet Infinity Band have has
been used in high performance computing.
It's a traditional, it's a standard, but infinity band is propriety.
It is fast, it has better latency as compared to ethernet, but it is
propriety and all its component, all its element, the infrastructure to build up.
And InfiniBand, a cluster.
It's expensive.
RDMA is natively supported on Infiniti Bank.
Like it does not require any tweaks or adjustments to run RDMA or Infiniti Bank.
You can just directly run RDMA or Infiniti Band as is.
However, when you talk about ethernet, so ethernet as we know is our best effort.
Technology, basically.
It's not lossless.
RDME requires lost us.
What we have been talking about and what we have done to run RDMA or ethernet
is we have developed some industry standard a particular protocol, which
we're going to talk about in the upcoming slide, and then we also have to kind
of configure or make the ethernet with certain condition mechanism to make it.
At on par within InfiniBand without compromising on its features.
So ethernet as such, we know is a standard.
Being used widely, very popular.
Most of the folks out there know how to operate ether, ethernet, the switches, the
infra, the optics, all the infrastructures are quite standard, and we can run the
entire backend network or e ethernet.
We don't have to change anything except.
For the part where we have to tweak the RDMA to how to use ethernet.
So we have to customize ethernet for RDMA and nothing is non-standard.
Everything is quite a standard procedure.
So the big question here, what it takes to get ethernet on par with Infin Bank.
So the answer is rocky, which is nothing but RDMA or converge ethernet.
What we do here is we encapsulate RDMA inside an ethernet frame.
We started with Rocky version one, which was only limited to layer two domains
because it did not have a UDP IP header.
Then that was slowly upgraded to Rocky version two, which where
we added the UDP IP headers.
RDMA would be able to route across layer three networks.
So this is what allows ethernet to scale AI clusters across racks,
row and entire data center halls.
So let's talk about more about Rocky V two.
With RO UE two GPU still use RDMA, but now the RDMA packets ride over ethernet.
The NIC can place data directly into the GPU memory without involving
the CPU kernel, which cuts latency from microseconds to nanoseconds.
This is a magic that makes ethernet a serious contender for AI networking.
Of course, using ethernet introduces one challenge.
RDA assumes a loss alert network, which you talked about.
And ethernet is best referred if the buffer fills up ethernet drop packets.
But RDA cannot tolerate drops.
So introduce two mechanisms here, starting with the ECN, which is the
explicit condition notification.
ECN does not drop packet insert.
It marks them.
When the switch starts to build queue pressure, the endpoints then
slow down based on those marks.
This keeps the network stable and predictable exactly
what AI workloads need.
The second mechanism is PFC or priority flow control, PFC,
pauses traffic, its specific priorities when congestion occurs.
This gives us.
Per prietary lossless behavior, even when multiple traffic types share the
same links, proper tuning of PFC, the headroom, the pause thresholds, and
the queue management is critical to deploying Rocky QV two successfully.
Looking ahead, the UAC or the Ultra Ethernet Consortium Initiative
is an industry group optimizing ethernet specifically for ai.
They're developing standards for better congestion control, lower latency, and
brought interoperability to replace proprietary solutions completely.
So what we are seeing here is UAC because we know that ethernet in
its own form cannot be used for.
The backend network, what we discussed about how we use Rocky V two and how
we are using congestion mechanism.
So the goal of the alter Ethernet consortium team is to ensure that
we develop a standard ethernet or a modern ethernet, what we call it, or
what they call it as ultra ethernet.
Which will work flawlessly for building AI backend networks.
Performance is critical, but so is security.
AI introduces new attack surfaces that traditions, tools don't always cover.
Some key risk included here, like I've put it on the slide, is
data poisoning, model extraction, privacy violations, insider threats.
As AI systems handle more sensitive and proprietary data, securing
the pipeline just becomes an important as a accelerating it.
To address these race, we rely on we many core strategies, which
is like encryption of data and transit and rest secure designs.
Continuous monitoring.
We monitor the AI network continuously to di discover any
anomalies in traffic flows or any anomalies in behaviors of the users.
And then access control, which is all very important in
aspects of securing our network.
So these guardrails ensure AI workloads remain secure even
as they scale dramatically.
Talking about dpu, DPU play a huge role in they enforce these security policies
like encryption and isolation right at the server edge, operating the work.
So the GPU can focus a hundred percent on training.
They also help with model partnering by managing data moment efficiently.
Let's bring everything together.
What we discussed so far is scalability aspect.
How we are making AI networks more scalable.
So AI is moving from a single server to massive clusters like we talked
about, how we require massive amounts of GPU to cater to the AI demand.
What we have.
So the network must also scale with it.
We discussed about performance, how ethernet with the Rocky V two can
provide the high throughput, low latency fabric AI workloads required.
Also, we talked about the UEC or the Ultra Ethernet consortium initiatives
where they're trying to develop ultra ethernet, which would work
flawlessly with the AI backend network.
And if you get the network architecture right, you can build a system that's far
secure and ready for the future of ai.
Thank you for your time.
I hope this was informative, and I wish you a great confidence.