Observability for Large Language Models

Video size:

Abstract

Unlock the secrets to scaling and maintaining Large Language Models! Discover how cutting-edge SRE practices can supercharge observability for AI systems. Learn how to ensure reliability, optimize performance, and keep these complex models running smoothly in production. Your AI future starts here!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. Welcome to Con 42 LLM 2025. My name is, I'm a tech engineering leader specializing in cloud infrastructure, SRE, and AI driven innovations. My career spans from Microsoft and fast pace startups. For this talk, we will focus on observability, SRE, and how this is implemented in s. Agenda will be introduction to AI LLM Trends, understanding LLM, SI fundamentals, observability in AI systems, SRE, roles in managing LMS case studies. And we'll conclude with some open discussions. AI is your needs. So we are right now experiencing an exponential arc in ai. We started in 2012 with AlexNet and slowly we trending towards GPT GPT-4. We are planning to gain a GI, which is artificial journal intelligence within the next five years. Okay. And hopefully with this, progress in next more than 10 years, maybe we can have artificial super intelligence, also current LLM market trend. So we are expecting 37.2% growth rate by 2030, and this will focus mostly on LLM side of it. Introduction to LLM models. So what is L-L-M-L-L-M are advanced AI systems trained on vast data sets to understand and generate human like text. Their applications are content generations, customer support, research, coding assistance, and a couple of examples of LM GPT-4 and palm. L and M have transport so many industries, including healthcare, education, software developments, overview of SRE, so introduction of SREs, SRE ensures reliability, scalability, efficiency in software systems. Key practices include proactive monitoring, automation and incident response. So SRE has some core principles, S-L-I-S-L-O, service level, objective, service level, indicators, error, budget. All these are guiding principles of SRE, but the high level focus is how we can balance reliability with innovation and speed. Next part is observability. So how observability can ensure. In AI system, so basically observability is more like a system behavior, which is measurable and understandable. We have three key pillars of possibility, metrics, logs, and traces. Metrics are continuity values, which is like CPU usage, memory utilization, API response times and logs are detailed. Timestamped records of system events. Traces are, visualizing the flow of requests through your various services of to identify bottleneck traces are very helpful in for developers. Basically challenges in observing lms, so LLM operates own enormous data sets, creating challenges in performance monitoring. And, LLM has a dynamic behavior. LLM adapts based on the inputs, make their response non-deterministic in case transparency. black box nature of LLM complicates understanding their internal decision process and scalability. Monitoring and managing LM require robust and scalable infrastructure. So how we can implement observability in LLM. So it's a step by step process. Step one is we need to define our key metrics, monitor prediction, latency, throughputs error rates. Step two is integrate advanced logging, so collect logs for input, output, and model resilience. Once that is done, we use distributed tracing. So map end-to-end interactions for pinpointing your failures. Okay, and deploy real time dashboards. Dashboard can be Grafana or any other visualization tools we can use. Now, what's the role of SRE in managing lms? So SRE responsibility and LLM include ensuring 24 by seven availability through automated failover systems, developing runbooks for common incident scenarios. Automation is definitely a key. So automate scaling up resources during high demand period. And this also involves tackling the use case of toils and other things. Collaboration with AI team is very important, so providing feedbacks on model performance for iterative improvements. we have multiple case studies which will focus on. One of the case study is AI chat bot monitoring response latency. To ensure the user experience healthcare AI is there, you can ensure compliance of medical guidelines and reduce false positive fraud detection. This can be in a security or banking system. Continuous monitoring of prediction accuracy is identified fraudulent transactions. So conclusion of my talk is basically focused on observability and SRE and these two concepts observability, SRE, are vital to ensure we have reliability and scalability of l and m. As AI system involve adopting innovative practices and prioritization, reliability practices will be the key to build resilient and trustworthy AI system. This drives value across industries. And finally, I have a recommendation on the books. So for SRE, definitely SRE Handbook, which was written by Nile Murphy. It, this is a on SRE dot, Google Books, and it's, free over there. That is very helpful. And another book, which I recently completed is about engineering. This book will definitely go beyond and help in implementing all the concepts that we just learned. Thank you for listening my talk. Yeah. See ya.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

Observability for Large Language Models

Video size:

Abstract

Summary

Transcript

Slides

Ankush Sharma

Engineering Leader

Join the community!

Featured event

2026

2025

Info

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

Observability for Large Language Models

Video size:

Abstract

Summary

Transcript

Slides

Ankush Sharma

Engineering Leader

Join the community!