Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi.
Welcome everybody.
Today I'm going to discuss about observability in the evolution of AI
and error network infrastructure, how the network observability powers the
backbone of artificial LI systems, and why it matters of the future
of hyper performance computing.
My name is Basra.
I was graduated from University India.
I worked in a couple of networking and observability companies where I have seen
firsthand how this matters, starting from the enterprise internet to all the way to
AI networks, which are prevalent nowadays.
In 1960s, aquanet, that was a packet switching network, which
was formed to facilitate research.
Internet in research institutions and universities.
So that was the foundation for of modern internet with very
rudimentary monitoring tools.
It has interface message processing, console networking, control center
tools, and CP debugging tools, ping like utilities and so on.
In 1980 to 2000, the evolution of s network management protocol had happened.
That was a net for, to, it was a standard tool for monitoring
the network performance and health metrics in 2002, 2015.
That's where most cloud related.
Revolution happened.
So in 2002 to 2015, the agent based monitoring which is installed on
VMs or containers like JX data, but Datadog agents software as service
delivery, reduce infrastructure.
But then.
Like New Relic p Datadog dashboard and visualizations
have become a common practice.
Real time dashboards become very standard like Datadog and Stack Driver.
Lifting systems like Threshold, threshold based, an anomaly detection alerts
like CloudWatch, Jix, integration with DevOps, tools like PagerDuty, slack,
CICD, and Systems Multi-Cloud monitoring.
2012 to 15 tools like Datadog and Stack Driver 2015 to now new AI era tools.
There.
Develop on top of whatever the cloud environments have it.
Pretty much like they're advanced in terms of network networking.
Observability like Datadog to collect unified metrics logs, security with the
AI based directing Grafana Pros, low key.
These are open source stack for metrics, logs, and visualization.
Open telemetry.
Open standard for collecting crisis metrics and logs arise.
AI fi, AA production model monitoring, drift detection, explainability.
Dynatrace full stack observability with determinist.
K again, the cloud providers like A-W-S-G-C-P Azure, they have native
tools integration integrated in their cloud environments for observability.
So this is the history of the observability tools.
Those have happened from the old internet.
Still our.
Why network observability matters for ea, there are very, some of the very
common reasons are like performance guarantee, pay failure, domain
isolation, resource optimization, security assurance, et cetera.
Performance guarantees for EAA involves setting and enforcing.
Measurable benchmarks for model accuracy, response time, scalability, and resource
utilization throughout the AI life cycle.
This guarantee ensures that AI systems may define service level
objectives under varying workloads and real world conditions.
And establishing such guarantees is critical for building trust,
ensuring reliability, and aligning a outputs with business or regulatory
expectations, a failure domain.
Isolation failure domain isolation.
AI refers to the practice of designing AI systems so that failures in one
component, such as data ingestion, model training, or inference do not
cascade and impact the entire pipeline.
By segmenting the system into independent fault tolerant
domains, teams can contain issues.
Simplify, debugging and maintain overall system resilience.
This approach is essential for ensuring high availability and durability in
complex and distributed environments.
Resource optimization.
This source optimization AI focuses on efficiently managing
computational storage and manage memory resources across the AI life cycle.
From data processing and model training to deployment and inference
techniques such as model pruning, quantization hardware acceleration,
and all cloud scheduling help reduce costs while maintaining performance
effective resource optimization.
Ensure scalability.
Lowers environmental impact and maximizes investments security which is
very important in AA networks because AA networks are always susceptible.
They're vulnerable to security threats.
Security assurance in AA involves implementing measures to protect
systems from threats such as data breaches, model theft.
Serial effects and unauthorized access.
It includes securing training data, ensuring model integrity,
enforcing access controls, and regularly editing system behavior.
Strong security assurance is critical for maintaining trust complaints and
the safety deployment of a AI in since two and high stakes environments.
Is not good.
Visual, there is a visual scale challenge, in networking observability.
The full stack visibility in AA re refers to a comprehensive monitoring and
understanding of the entire AA pipeline.
From data ingestion and pre-processing to model training, deployment, and
inference, it allows teams to track data quality, model performance, and
infrastructure usage in real time, ensuring transparency and accountability.
This visibility is crucial for detecting biases, managing drift,
and maintaining the reliability and ethical integrity of AA systems.
Excuse me.
Distributed tracing.
Distributed tracing in the AI provides visibility into the flow of
data and execution across different components of an AI system, such
as data pipelines, model training services, APIs, and inference engines.
It enables teams to track requests and processing steps and end to end
help helping identifying latency, bottlenecks, or failures across.
Distributed environments.
This traceability is crucial for debugging, optimizing performance
and ensuring accountability in complex AI workflows.
Realtime telemetry this is an advanced version of monitoring this in AI
involves the continuous collection and monitoring of metrics, logs and
events from AI systems as they operate.
It provides immediate insights into model performance, data, quality,
system health, and user interactions.
This enables the rapid detection of anomalies, supports proactive
maintenance, and ensures a systems remain reliable, efficient, and aligned with
expectations In dynamic environments, hardware level metrics in EA refers
to the monitoring of low level system indicators such as G-P-U-C-P utilization.
Memory, bandwidth, power consumption, temperature, and disk input.
Output.
These metrics are critical for understanding the resource demands
of EA workloads, optimizing performance and preventing
hardware bottleneck or failures.
By analyzing hardware level data, teams can fine tune model deployments,
improve efficiency, and ensure sustainable scaling of AI operations.
Inside the modern network like the port speed.
Once upon a time, if it was a hundred gig, it was a revolution.
But nowadays I speak, there are systems which are being developed
with 800 gig port speed, a single box in networking as 51.2 terabytes
per second processing capability, data processing capability such as
the speed, because that is the speed.
Required by AA systems extraordinary volume of data packets, processes daily
with the large scale AI infrastructure.
Ultra low latency, near instantaneous response are very
critical for synchronizing.
Network training, reliability, of course, the finance availability,
essential for uninterrupted model training and inference operations.
So the network has become a very crucial comprising of speed, latency.
Reliability, and throughput.
Observability.
So metrics with metrics.
They provide quantitative data on system performance and health.
A specific metrics include model accuracy, latency, throughput, data drift, and
resource utilization, which help track how will the AI systems meet its objectives.
These metrics enable teams to quickly detect anomalies, optimize
model behavior, and maintain reliable, scalable AI deployments.
Logs provide detailed timestamp records of discre events within a system.
Such as data processing steps, model training, isolation address, and user
interactions, logs of rich, contextual information that helps diagnose
issues, press failures, and understand the sequence of actions leading to
a problem in a comprehensive logging is variation for editing, debugging
complex workflows, and ensuring transparency throughout the AI life cycle.
Processing AA captured the detailed end to engineer requests or data as
they move through various components of an AA system such as data inges
model training and inference services.
They help map the sequence and timing of operations, revealing dependency
dependencies and pinpointing where delays are errors occur.
Expressing is crucial for diagnosing performance issues, understanding
system behavior, and ensuring a smooth, relatable, a workflows
in distributed AI environments.
Programmable telemetry revolution.
Telemetry has become a rich monitoring tools like Google,
RPC, and various other products.
Lot of companies are developing telemetry tools, but traditional monitoring
focuses on correcting and analyzing predefined system metrics like CPU use
a memory consumption dis, and network traffic to ensure infrastructure and
applications surrounding smoothly.
It often delays on threshold based alerts and periodic checks to detect
failures or performance issues, while effective for basic operational health.
Traditional monitoring likes the deep contextual insights needed for complex
modern AI or distributor systems.
Modern telemetry such as GRPC goes beyond traditional monitoring by continuously
correcting rich high resolution data, including metrics, logs, and events.
Traces, of course, from all layers of a system in real time.
It leverages advanced analytics, machine learning and automation to
provide deeper insights into system behavior, detect anomalies, and predict
issues before they impact users.
This approach is essential for managing complex distributed AI
systems and dynamic cloud environments with greater accuracy and agility.
Self-healing network architectures.
So the networks no longer should no longer need to wait for healing for a human.
They have to self-heal so that they will be resilient, they will
be available they'll be hype.
They will be crossing high throughput.
So detect self-healing network architectures, detect operational
efficient issues by continuously monitoring network health through
real time telemetry, including metrics, logs, and distributed traces.
They use automated anomaly detection.
A analytics and predefined rules to identify faults, performance,
degradations, or security threats without human intervention.
Once detected, these systems can trigger automatic remediation actions, like
re rerouting the traffic, restarting services, or adjusting configurations, or
to maintain optimal network performance and minimize downtime in self-feeding.
Network architecture analysis plays a critical role.
By processing the continuous stream of telemetry data to
identify patterns, anomalies, and root causes of network issues.
This involves yay and machine learning algorithms to correlate metrics,
logs, and traces across distributed components, enabling real-time
diagnosis, failures, diagnosis of failures, or performance degradations.
Effective analysis allows the system to prioritize incidents,
predict potential faults, and make informations for automated recovery,
thereby minimizing human intervention and ensuring network resilience.
The execute phase involves automatically carrying out corrective actions once an
issue has been detected and analyzed.
This can include tasks like traffic around fault.
No.
Restarting the failed services, adjusting configurations, or
scaling resources dynamically.
Education is driven by prefin playbooks.
A power decision engines are orchestration tools enabling the
network to quickly recover from fault and maintain continuous reliable
operation without manual intervention.
In healing networks, the decide phase involves evaluating the analyze
data to determine the most effective remediation action, the decision making.
Process liberates serial GOs policy rules and historical incident knowledge
to select the best course of action, whether to isolate a faulty component,
trigger, create failover, or a play configuration change, accurate in time,
limitations, or crucial for minimizing downtime, preventing cascading failures,
and maintaining overall network stability.
These are various different flows that happen.
AA networks model training in AI is the process of teaching a machine learning
algorithm to recognize patterns in data by adjusting its parameters to
improve performance on a specific task.
Parameters serving in EAA is the process of effectively delivering trained model
parameters to influence systems to enable fast and scalable predictions.
Data ingestion is the process of collecting, importing, and
processing raw data from various sources into your system.
For analysis are a model.
Training inference in AI is the process where a trained model makes
predictions orions based on new RNC data.
Of course visibility is a very challenge because to observe the
networks, we need to have visibility.
So that is a still challenge in some areas, like hardware opacity.
Challenges in AA systems refers to the.
Difficulty of understanding and optimizing how underlying hardware
like GPU TPU or Specialist Accelerators affect AI model performance due to
limited visibility in the into hardware operations and complex interactions
between software and hardware layers.
Hardware is made from made up.
Hardware is made by different vendors.
They follow different protocols, different procedures.
So it's a common standard to.
Digging into the hardware is a challenge.
Cross-domain correlation in AI system refers to identifying and analyzing
relationships between data events or behaviors across different domains or
components, such as combining in insights from user behavior system logs, network
metrics to improve model accuracy, detect anomalies, or enhance decision making.
Scale limitations in observability.
Challenges refer to the difficulties in collecting, processing, and analyzing
massive volumes of telemetry data generated by large distributed AI systems.
As systems grow our traditional tools administered with data storage, real-time
processing, and maintaining low latency insights, making it harder to achieve.
Comprehensive visibility and timely troubleshooting power coming.
These limitations require scalable architectures and intelligent data
sampling or aggregation techniques.
How the future of observable networks, so a powered observability leverages
artificial intelligence and machine learning to automatically collect.
To analyze and interpret vast amounts of telemetry data, such as metrics, logs,
and traces to detect anomalies, predict failures, and provide actionable insights.
This approach enhances traditional monitoring by enabling faster root
cause analysis, proactive issue resolution, and smart addition making
in complex AI and distributed systems.
AL level telemetry refer to the monitoring and collection of detailed
performance and health data directly from the hardware chip, such as
CPUs, GPUs, or AI accelerators.
This includes metrics like power, uses, temperature clock speeds.
Cache hits or misses and hardware interrupts, providing deep visibility
into how the physical hardware operates and impacts a workloads.
Such fine-grain telemetry is essential for optimizing performance.
Detecting hardware falls early and improving overall system reliability.
Excuse me.
Intent based observability in AI focuses on aligning monitoring and analysis
tools with the desired business or operational outcomes By interpreting
the system's intent behaviors, instead of just collecting raw data,
it uses AI to understand whether the system actions meet pre-define goals,
enabling a prior to detection of deviations and automated adjustments.
This approach helps ensure AI systems remain aligned.
With user expectations and business objectives in dynamic environments.
For example, instead of checking how the CPU sales is, it's a business based
policy, are we serving the 4K quality to the important users in Euro region?
So that is the business objective.
Based on that, we can take corrective actions in the observable.
Digital twin networks are virtual replicas of physical network infrastructure that
simulate their behavior, performance, and interactions in real time.
They enable a testing, monitoring, and optimizing network operations
by mirroring live conditions without impacting the actual system.
So this technology helps in predictive maintenance, capacity planning,
and accelerating troubleshooting in complex networking environments.
So in a closing note, our network observability has become a, it plays a
key role in planning, maintaining, and
maintaining networks for high resiliency by processing of throughput.
Ultra latency and so on.
So this network durability is very important in the coming years of a based
networks, which are very low latency and high performance computing networks.
Thank you.