Invisible Highways: Observability in the Evolution of AI-Era Networking Infrastructure

Video size:

Abstract

Step inside the high-speed world of AI-era networking—where billions of packets fly and milliseconds matter. Learn how deep observability is transforming networks from black boxes into transparent, self-healing systems powering today’s most demanding workloads.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi. Welcome everybody. Today I'm going to discuss about observability in the evolution of AI and error network infrastructure, how the network observability powers the backbone of artificial LI systems, and why it matters of the future of hyper performance computing. My name is Basra. I was graduated from University India. I worked in a couple of networking and observability companies where I have seen firsthand how this matters, starting from the enterprise internet to all the way to AI networks, which are prevalent nowadays. In 1960s, aquanet, that was a packet switching network, which was formed to facilitate research. Internet in research institutions and universities. So that was the foundation for of modern internet with very rudimentary monitoring tools. It has interface message processing, console networking, control center tools, and CP debugging tools, ping like utilities and so on. In 1980 to 2000, the evolution of s network management protocol had happened. That was a net for, to, it was a standard tool for monitoring the network performance and health metrics in 2002, 2015. That's where most cloud related. Revolution happened. So in 2002 to 2015, the agent based monitoring which is installed on VMs or containers like JX data, but Datadog agents software as service delivery, reduce infrastructure. But then. Like New Relic p Datadog dashboard and visualizations have become a common practice. Real time dashboards become very standard like Datadog and Stack Driver. Lifting systems like Threshold, threshold based, an anomaly detection alerts like CloudWatch, Jix, integration with DevOps, tools like PagerDuty, slack, CICD, and Systems Multi-Cloud monitoring. 2012 to 15 tools like Datadog and Stack Driver 2015 to now new AI era tools. There. Develop on top of whatever the cloud environments have it. Pretty much like they're advanced in terms of network networking. Observability like Datadog to collect unified metrics logs, security with the AI based directing Grafana Pros, low key. These are open source stack for metrics, logs, and visualization. Open telemetry. Open standard for collecting crisis metrics and logs arise. AI fi, AA production model monitoring, drift detection, explainability. Dynatrace full stack observability with determinist. K again, the cloud providers like A-W-S-G-C-P Azure, they have native tools integration integrated in their cloud environments for observability. So this is the history of the observability tools. Those have happened from the old internet. Still our. Why network observability matters for ea, there are very, some of the very common reasons are like performance guarantee, pay failure, domain isolation, resource optimization, security assurance, et cetera. Performance guarantees for EAA involves setting and enforcing. Measurable benchmarks for model accuracy, response time, scalability, and resource utilization throughout the AI life cycle. This guarantee ensures that AI systems may define service level objectives under varying workloads and real world conditions. And establishing such guarantees is critical for building trust, ensuring reliability, and aligning a outputs with business or regulatory expectations, a failure domain. Isolation failure domain isolation. AI refers to the practice of designing AI systems so that failures in one component, such as data ingestion, model training, or inference do not cascade and impact the entire pipeline. By segmenting the system into independent fault tolerant domains, teams can contain issues. Simplify, debugging and maintain overall system resilience. This approach is essential for ensuring high availability and durability in complex and distributed environments. Resource optimization. This source optimization AI focuses on efficiently managing computational storage and manage memory resources across the AI life cycle. From data processing and model training to deployment and inference techniques such as model pruning, quantization hardware acceleration, and all cloud scheduling help reduce costs while maintaining performance effective resource optimization. Ensure scalability. Lowers environmental impact and maximizes investments security which is very important in AA networks because AA networks are always susceptible. They're vulnerable to security threats. Security assurance in AA involves implementing measures to protect systems from threats such as data breaches, model theft. Serial effects and unauthorized access. It includes securing training data, ensuring model integrity, enforcing access controls, and regularly editing system behavior. Strong security assurance is critical for maintaining trust complaints and the safety deployment of a AI in since two and high stakes environments. Is not good. Visual, there is a visual scale challenge, in networking observability. The full stack visibility in AA re refers to a comprehensive monitoring and understanding of the entire AA pipeline. From data ingestion and pre-processing to model training, deployment, and inference, it allows teams to track data quality, model performance, and infrastructure usage in real time, ensuring transparency and accountability. This visibility is crucial for detecting biases, managing drift, and maintaining the reliability and ethical integrity of AA systems. Excuse me. Distributed tracing. Distributed tracing in the AI provides visibility into the flow of data and execution across different components of an AI system, such as data pipelines, model training services, APIs, and inference engines. It enables teams to track requests and processing steps and end to end help helping identifying latency, bottlenecks, or failures across. Distributed environments. This traceability is crucial for debugging, optimizing performance and ensuring accountability in complex AI workflows. Realtime telemetry this is an advanced version of monitoring this in AI involves the continuous collection and monitoring of metrics, logs and events from AI systems as they operate. It provides immediate insights into model performance, data, quality, system health, and user interactions. This enables the rapid detection of anomalies, supports proactive maintenance, and ensures a systems remain reliable, efficient, and aligned with expectations In dynamic environments, hardware level metrics in EA refers to the monitoring of low level system indicators such as G-P-U-C-P utilization. Memory, bandwidth, power consumption, temperature, and disk input. Output. These metrics are critical for understanding the resource demands of EA workloads, optimizing performance and preventing hardware bottleneck or failures. By analyzing hardware level data, teams can fine tune model deployments, improve efficiency, and ensure sustainable scaling of AI operations. Inside the modern network like the port speed. Once upon a time, if it was a hundred gig, it was a revolution. But nowadays I speak, there are systems which are being developed with 800 gig port speed, a single box in networking as 51.2 terabytes per second processing capability, data processing capability such as the speed, because that is the speed. Required by AA systems extraordinary volume of data packets, processes daily with the large scale AI infrastructure. Ultra low latency, near instantaneous response are very critical for synchronizing. Network training, reliability, of course, the finance availability, essential for uninterrupted model training and inference operations. So the network has become a very crucial comprising of speed, latency. Reliability, and throughput. Observability. So metrics with metrics. They provide quantitative data on system performance and health. A specific metrics include model accuracy, latency, throughput, data drift, and resource utilization, which help track how will the AI systems meet its objectives. These metrics enable teams to quickly detect anomalies, optimize model behavior, and maintain reliable, scalable AI deployments. Logs provide detailed timestamp records of discre events within a system. Such as data processing steps, model training, isolation address, and user interactions, logs of rich, contextual information that helps diagnose issues, press failures, and understand the sequence of actions leading to a problem in a comprehensive logging is variation for editing, debugging complex workflows, and ensuring transparency throughout the AI life cycle. Processing AA captured the detailed end to engineer requests or data as they move through various components of an AA system such as data inges model training and inference services. They help map the sequence and timing of operations, revealing dependency dependencies and pinpointing where delays are errors occur. Expressing is crucial for diagnosing performance issues, understanding system behavior, and ensuring a smooth, relatable, a workflows in distributed AI environments. Programmable telemetry revolution. Telemetry has become a rich monitoring tools like Google, RPC, and various other products. Lot of companies are developing telemetry tools, but traditional monitoring focuses on correcting and analyzing predefined system metrics like CPU use a memory consumption dis, and network traffic to ensure infrastructure and applications surrounding smoothly. It often delays on threshold based alerts and periodic checks to detect failures or performance issues, while effective for basic operational health. Traditional monitoring likes the deep contextual insights needed for complex modern AI or distributor systems. Modern telemetry such as GRPC goes beyond traditional monitoring by continuously correcting rich high resolution data, including metrics, logs, and events. Traces, of course, from all layers of a system in real time. It leverages advanced analytics, machine learning and automation to provide deeper insights into system behavior, detect anomalies, and predict issues before they impact users. This approach is essential for managing complex distributed AI systems and dynamic cloud environments with greater accuracy and agility. Self-healing network architectures. So the networks no longer should no longer need to wait for healing for a human. They have to self-heal so that they will be resilient, they will be available they'll be hype. They will be crossing high throughput. So detect self-healing network architectures, detect operational efficient issues by continuously monitoring network health through real time telemetry, including metrics, logs, and distributed traces. They use automated anomaly detection. A analytics and predefined rules to identify faults, performance, degradations, or security threats without human intervention. Once detected, these systems can trigger automatic remediation actions, like re rerouting the traffic, restarting services, or adjusting configurations, or to maintain optimal network performance and minimize downtime in self-feeding. Network architecture analysis plays a critical role. By processing the continuous stream of telemetry data to identify patterns, anomalies, and root causes of network issues. This involves yay and machine learning algorithms to correlate metrics, logs, and traces across distributed components, enabling real-time diagnosis, failures, diagnosis of failures, or performance degradations. Effective analysis allows the system to prioritize incidents, predict potential faults, and make informations for automated recovery, thereby minimizing human intervention and ensuring network resilience. The execute phase involves automatically carrying out corrective actions once an issue has been detected and analyzed. This can include tasks like traffic around fault. No. Restarting the failed services, adjusting configurations, or scaling resources dynamically. Education is driven by prefin playbooks. A power decision engines are orchestration tools enabling the network to quickly recover from fault and maintain continuous reliable operation without manual intervention. In healing networks, the decide phase involves evaluating the analyze data to determine the most effective remediation action, the decision making. Process liberates serial GOs policy rules and historical incident knowledge to select the best course of action, whether to isolate a faulty component, trigger, create failover, or a play configuration change, accurate in time, limitations, or crucial for minimizing downtime, preventing cascading failures, and maintaining overall network stability. These are various different flows that happen. AA networks model training in AI is the process of teaching a machine learning algorithm to recognize patterns in data by adjusting its parameters to improve performance on a specific task. Parameters serving in EAA is the process of effectively delivering trained model parameters to influence systems to enable fast and scalable predictions. Data ingestion is the process of collecting, importing, and processing raw data from various sources into your system. For analysis are a model. Training inference in AI is the process where a trained model makes predictions orions based on new RNC data. Of course visibility is a very challenge because to observe the networks, we need to have visibility. So that is a still challenge in some areas, like hardware opacity. Challenges in AA systems refers to the. Difficulty of understanding and optimizing how underlying hardware like GPU TPU or Specialist Accelerators affect AI model performance due to limited visibility in the into hardware operations and complex interactions between software and hardware layers. Hardware is made from made up. Hardware is made by different vendors. They follow different protocols, different procedures. So it's a common standard to. Digging into the hardware is a challenge. Cross-domain correlation in AI system refers to identifying and analyzing relationships between data events or behaviors across different domains or components, such as combining in insights from user behavior system logs, network metrics to improve model accuracy, detect anomalies, or enhance decision making. Scale limitations in observability. Challenges refer to the difficulties in collecting, processing, and analyzing massive volumes of telemetry data generated by large distributed AI systems. As systems grow our traditional tools administered with data storage, real-time processing, and maintaining low latency insights, making it harder to achieve. Comprehensive visibility and timely troubleshooting power coming. These limitations require scalable architectures and intelligent data sampling or aggregation techniques. How the future of observable networks, so a powered observability leverages artificial intelligence and machine learning to automatically collect. To analyze and interpret vast amounts of telemetry data, such as metrics, logs, and traces to detect anomalies, predict failures, and provide actionable insights. This approach enhances traditional monitoring by enabling faster root cause analysis, proactive issue resolution, and smart addition making in complex AI and distributed systems. AL level telemetry refer to the monitoring and collection of detailed performance and health data directly from the hardware chip, such as CPUs, GPUs, or AI accelerators. This includes metrics like power, uses, temperature clock speeds. Cache hits or misses and hardware interrupts, providing deep visibility into how the physical hardware operates and impacts a workloads. Such fine-grain telemetry is essential for optimizing performance. Detecting hardware falls early and improving overall system reliability. Excuse me. Intent based observability in AI focuses on aligning monitoring and analysis tools with the desired business or operational outcomes By interpreting the system's intent behaviors, instead of just collecting raw data, it uses AI to understand whether the system actions meet pre-define goals, enabling a prior to detection of deviations and automated adjustments. This approach helps ensure AI systems remain aligned. With user expectations and business objectives in dynamic environments. For example, instead of checking how the CPU sales is, it's a business based policy, are we serving the 4K quality to the important users in Euro region? So that is the business objective. Based on that, we can take corrective actions in the observable. Digital twin networks are virtual replicas of physical network infrastructure that simulate their behavior, performance, and interactions in real time. They enable a testing, monitoring, and optimizing network operations by mirroring live conditions without impacting the actual system. So this technology helps in predictive maintenance, capacity planning, and accelerating troubleshooting in complex networking environments. So in a closing note, our network observability has become a, it plays a key role in planning, maintaining, and maintaining networks for high resiliency by processing of throughput. Ultra latency and so on. So this network durability is very important in the coming years of a based networks, which are very low latency and high performance computing networks. Thank you.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Invisible Highways: Observability in the Evolution of AI-Era Networking Infrastructure

Video size:

Abstract

Summary

Transcript

Slides

Bhaskararao Vakamullu

@ Anna University, chennai - India

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Invisible Highways: Observability in the Evolution of AI-Era Networking Infrastructure

Video size:

Abstract

Summary

Transcript

Slides

Bhaskararao Vakamullu

@ Anna University, chennai - India

Join the community!