Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
Welcome to Con 42 LLM 2025.
My name is, I'm a tech engineering leader specializing in cloud infrastructure,
SRE, and AI driven innovations.
My career spans from Microsoft and fast pace startups.
For this talk, we will focus on observability, SRE, and
how this is implemented in s.
Agenda will be introduction to AI LLM Trends, understanding LLM, SI
fundamentals, observability in AI systems, SRE, roles in managing LMS case studies.
And we'll conclude with some open discussions.
AI is your needs.
So we are right now experiencing an exponential arc in ai.
We started in 2012 with AlexNet and slowly we trending towards GPT GPT-4.
We are planning to gain a GI, which is artificial journal intelligence
within the next five years.
Okay.
And hopefully with this, progress in next more than 10 years, maybe we can
have artificial super intelligence, also
current LLM market trend.
So we are expecting 37.2% growth rate by 2030, and this will
focus mostly on LLM side of it.
Introduction to LLM models.
So what is L-L-M-L-L-M are advanced AI systems trained on vast data sets to
understand and generate human like text.
Their applications are content generations, customer support, research,
coding assistance, and a couple of examples of LM GPT-4 and palm.
L and M have transport so many industries, including healthcare,
education, software developments,
overview of SRE, so introduction of SREs, SRE ensures reliability, scalability,
efficiency in software systems.
Key practices include proactive monitoring, automation
and incident response.
So SRE has some core principles, S-L-I-S-L-O, service level, objective,
service level, indicators, error, budget.
All these are guiding principles of SRE, but the high level focus
is how we can balance reliability with innovation and speed.
Next part is observability.
So how observability can ensure.
In AI system, so basically observability is more like a system behavior, which
is measurable and understandable.
We have three key pillars of possibility, metrics, logs, and traces.
Metrics are continuity values, which is like CPU usage, memory utilization, API
response times and logs are detailed.
Timestamped records of system events.
Traces are, visualizing the flow of requests through your various services
of to identify bottleneck traces are very helpful in for developers.
Basically challenges in observing lms, so LLM operates own enormous
data sets, creating challenges in performance monitoring.
And, LLM has a dynamic behavior.
LLM adapts based on the inputs, make their response non-deterministic
in case transparency.
black box nature of LLM complicates understanding their internal
decision process and scalability.
Monitoring and managing LM require robust and scalable infrastructure.
So how we can implement observability in LLM.
So it's a step by step process.
Step one is we need to define our key metrics, monitor prediction,
latency, throughputs error rates.
Step two is integrate advanced logging, so collect logs for input,
output, and model resilience.
Once that is done, we use distributed tracing.
So map end-to-end interactions for pinpointing your failures.
Okay, and deploy real time dashboards.
Dashboard can be Grafana or any other visualization tools we can use.
Now, what's the role of SRE in managing lms?
So SRE responsibility and LLM include ensuring 24 by seven availability through
automated failover systems, developing runbooks for common incident scenarios.
Automation is definitely a key.
So automate scaling up resources during high demand period.
And this also involves tackling the use case of toils and other things.
Collaboration with AI team is very important, so providing feedbacks on model
performance for iterative improvements.
we have multiple case studies which will focus on.
One of the case study is AI chat bot monitoring response latency.
To ensure the user experience healthcare AI is there, you can ensure
compliance of medical guidelines and reduce false positive fraud detection.
This can be in a security or banking system.
Continuous monitoring of prediction accuracy is identified
fraudulent transactions.
So conclusion of my talk is basically focused on observability and SRE and
these two concepts observability, SRE, are vital to ensure we have reliability
and scalability of l and m. As AI system involve adopting innovative
practices and prioritization, reliability practices will be the key to build
resilient and trustworthy AI system.
This drives value across industries.
And finally, I have a recommendation on the books.
So for SRE, definitely SRE Handbook, which was written by Nile Murphy.
It, this is a on SRE dot, Google Books, and it's, free over there.
That is very helpful.
And another book, which I recently completed is about engineering.
This book will definitely go beyond and help in implementing all the
concepts that we just learned.
Thank you for listening my talk.
Yeah.
See ya.