Scaling Prompt Engineering: How Ultra Ethernet and UALink Accelerate Token-to-Token Performance
Video size:
Abstract
Prompt engineering at scale needs more than clever text it needs blazing-fast infrastructure. Learn how Ultra Ethernet and UALink boost throughput, reduce inference latency, and accelerate real-time LLM performance across distributed AI systems.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
My name is Raje.
I'm principal engineer at Synopsis.
My presentation is about how ultra ethernet and UL link accelerate
token to token performance.
In my presentation, I'll be covering on AI infrastructure bottleneck
need for enhanced interconnect technology, UL link Ultra ethernet,
and finally the conclusion.
AI infrastructure bottleneck in the present AI infrastructure.
There is exponential growth in the compute demand.
Despite parallelization, the training time of model have rise from week to months.
If you look into the picture on the right hand side, model parameters are
doubling every three to four months.
Multi-dimensional systems are getting stretched, like network bandwidth,
network latency, memory, bandwidth, memory capacity, and then compute.
Also the design complexities are increasing memory bandwidth,
interconnect bandwidth.
Now if you look into the extreme right side, we have a DDR standard, HBM
standard PCA standard and DD day to day standard are exponentially increasing.
Therefore, we need AI infrastructure with enhanced interconnect technology
to meet current and future demand.
Need for enhanced interconnect technology.
In this slide, let try to understand why GPU performance is important.
If you consider AI ML lifecycle in the picture, to build a model, we need to
prep the data first, we need to prep the data, build a model, train it,
test it, and then fine tune the model.
So there is a continuous feedback provided to fine tune the model,
and this loop continues during which data is split and fed.
To multiple GPUs and sometimes multiple machines at larger scale.
Therefore, GPU performance influences the timeline of deep learning.
UL Link Scale Up UL Link is an open source interconnect
technology developed to scale up.
Accelerators for AI workload.
If you look into the picture here, we have a pod.
This is also called as a cluster.
We have a racks in them stacked vertically.
Each rack will have a GPU.
All these GPUs are interconnected to get together through a UL link.
So what are we trying to achieve here is we are trying to connect all the GPUs
together to create one big giant GPU.
To enable memory sharing and synchronization between the
accelerators so that there is a direct load store and automat operations
enabling between the accelerators.
UL Link can connect up to a hundred to thousands of GPUs
together ultra ethernet scale out.
So Ultra Ethernet is an open source, high performance networking technology
developed by ultra ethernet consortium to offload AI workload and HPC.
If you look into the picture on the right hand side, we all, we have
already discussed about UL Link now talking about ultra ethernet,
which is highlighted in the local.
These are.
Connecting all the clusters together, which is called scale out.
This establishes a high bandwidth, multi-path open, standard,
highly configurable interface.
This is very much important for AI clustering and also ALTA ethernet
stack introduces new transport layer with enhanced congestion control
and then enhanced RDMA capabilities.
Here we are talking about interconnecting millions of GPUs together.
Conclusion Token to token performance.
UL Link is used to connect two aspirators together so that there is a memory
synchronization happening between them.
This scenario is optimized for AI workload because of rapid token passing.
Ultra ethernet establishes low latency, high throughput for rapid
token exchange and scalability.
Thank you for taking time and watching my presentation.