Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
Thank you for being here today.
I'll be presenting on a topic scaling conversational AI, SRE, challenges
and solution for high availability.
CCA. Systems.
C. CA systems are increasingly used to handle millions, even billions
of customer interactions every day.
These systems, power virtual agents, automated chat responses,
voice recognitions, and much more.
With this scale, reliability becomes absolutely critical.
In this session, we have explored the unique site Reliability Engineering
Challenges in CCAI and the proven strategies to overcome them from
observability frameworks and automated remediation to progressive
rollouts and chaos testing.
I'm excited to share the practical lessons and battle tested techniques that can help
build more resilient AI driven platform.
The CCA reliability challenge.
Let's start by understanding the scale and constraints we operate under.
In CCA systems, we aim for 99.99% uptime.
That's just over five minutes of downtime per month.
This is a non negotiated standard for enterprise systems, especially when we
are dealing with customer interaction.
The key requirement is latency.
Our target is maximum of 300 milliseconds response time.
This is incredibly tight for systems powered by heavy NLP models
and multi-service dependencies.
With automation, we've been able to reduce MTTR mean time to recovery by 73%,
which is a huge win in improving customer experience and operational efficiency.
Unlike traditional web apps.
Failures in CCA are immediately visible.
They can frustrat customers cause escalated and directly
impacting business KPIs.
What makes CCI even more challenging is the unpredictable nature of traffic
and the complex of above integrations like cms, speech systems, third party
APIs, et cetera, plus the AI model themselves are not easy to interpret.
They are black boxes in many ways.
This all add ups to a very unique reliability challenge, which we'll
be discussing in the next slide.
Traditional monitoring limitations.
Traditional monitoring tools were never built with AI systems in mind.
For instance, transformer based and LP models, which power intent,
recognition and response generations are essentially black boxes.
We can't inspect their internal states easily.
Tools that work well with microservices often fail here.
Another big problem is load patents.
CCA systems don't follow typical usage trends.
You might have traffic spikes during product launches, seasonal sales,
or even random social media trends, threshold base electing doesn't work
well because it can't predict or adapt to these irregular patterns.
Then there is issue of complex dependency chains.
Again, a CA sessions might involve APIs, databases, voice engines,
third party integrations, et cetera.
A small failure in one backend can ripple through their entire AI stack,
and traditional monitoring might not have been able to detect this.
That's why we had to build a new AI observability model, which will
be discussing in the next slide.
CCA observability framework.
To solve these gaps, we developed a four layer observability framework tailored
for CCA systems, one business impact metrics like conversation, completion
rates, and c, a customer satisfaction.
These tell us if users are getting what they need.
Two model performance metrics.
We track intent, accuracy, fallback rate.
And entity extraction quality.
These gives us insights into how well the NLP models are
performing system performance.
Here we look at latency, error rate, throughputs, et cetera.
These are all a traditional ARI metrics for infrastructure health.
This is about CPU memory, discus H, and system saturation.
By combining all of these, we can map a symptom.
Like an increase in fallback response to its root cause, whether it's a model drift
back in slowdown or infrastructure issue.
This observability model helps bridge the gap between business
impact and technical telemetry.
Moving on to the next slide, automated remediation strategies.
Now even with great observability incidents will happen.
So our next priority is automated remediation.
Reducing the recovery time when things go wrong.
Our process starts with anomaly detection using ML based pattern recognition.
This lets us spot unusual conversation flows or system metrics in real time.
Then we triage and classify the issue automatically based on patterns we
learned over time, like intern re degradation or resource starvation.
Once classified, we trigger a remediation action.
This might involve scaling up, rerouting the traffic, throttling,
or even restarting specific services.
And finally, we have a feedback loop.
Each incident teaches a system something new every time, improving
future detection and resolutions.
This automated approach has helped us reducing the MTTR by more than 70%.
Which means user barely noticed something went wrong.
Scaling strategies for peak loads.
Let's talk about scaling, particularly during peak loads.
Traffic spikes in CCA systems are hard to predict, but not impossible.
We use predict auto-scaling to anticipate these based on historical data and events.
Next, we use token based rotting.
A fine-grained way to control the number of conversations being
processed, preventing overload, while prioritizing the high value queries.
We also implement conversation shading, distributing different conversations
to different clusters or chats, so no single resource pool gets overwhelmed.
And during extreme loads, we use model optimizations like
switching to lightweight models.
Or reducing precision temporarily to maintain responsiveness.
All of this helps escape latency under 300 milliseconds, even
during 10 x traffic bursts.
Moving on to the next slide, common CCA patents.
Through our operational experience, we identified several recurring failure
patterns, unique to CCIA, again, intent classification, degradation.
This often happens gradually.
A new product name or slang throws off the model and accuracy drops over time.
Latency amplification cascades a small backend delay gets compounded.
As a AI pipeline processes laing to exponential latency.
We resource starvation.
A single edge case user query might consume excessive memory
or CPU affecting all sessions.
Dependency chain failures.
When a single external API or a backend service goes down, it can take
down the entire conversational flow.
Knowing these patterns helps us build better detection, alerting,
and chaos testing frameworks, which will be discussing in the next slide.
Chaos engineering.
This brings us to chaos engineering, but with a twist.
One that's tailored for ai.
Our goal is to proactively test what happens when things go wrong
in the AI layer, specifically, not just infrastructure.
Again, we stimulate scenarios like corrupted conversation, memory,
misclassified intents in a sequential order, or even model component
failure due to memory overflow.
Again.
Now the process involves of performing hypothesis about.
Failure modes, designing targeted chaos experiments, executing them with
safety controls, and then analyzing the results and hardening the system.
This AI focused chaos engineering help us prevent unknown failure modes
from surprising as in a production
progressive deployment models.
All models evolve fast.
But deploying them safely is also very crucial.
So we use progressive deployment modes first.
We do shadow testing where a new model runs alongside production
but doesn't serve real users.
We compare results quietly.
Next we move to a candidate deployment.
We are, a small slices of traffic is routed to the new model where we monitor
closely again, then we gradually expand.
Traffic based on performance metrics and confidence level, often adjusting
during business hours to minimize the impact if anything goes wrong.
We have automated rollback mechanisms in place.
Once we validated everything, we move to a full deployment.
This stage rollout ensure we are innovating safely without
affecting the end user experience.
Moving on to the next slide.
SLIs and SLOs for cost optimization.
Let's now talk about how we balance performance with cost SLOs or
service level objectives don't just help ensure reliability.
They also help us avoid over provisioning, but carefully tracking metrics like
response latency, query throughput, resource utilization, and intent accuracy.
We can right size our infrastructure.
For example, if a model performs just as well with 50% fewer GPUs during off
peak hours, that's a massive cost saving.
Across the board, we've seen 30 to 40% cost optimization by aligning SLOs with
actual user needs and system capacities.
Moving on to the next slide.
Building reliable CCAI.
Key takeaways.
To wrap up, here are the key takeaways for building reliable CCA systems.
Observability must include AI specific and business level
metrics, not just CPUs and memories.
Automation is key.
Self feeling systems drastically reduce downtime.
Use balanced SLOs to avoid unnecessary cost or refer while
still maintaining a great experience.
Finally, a progressive deployment.
Progressive deployments are very critical for rolling out changes
without breaking user flows.
These strategies help us build systems that are both innovative and reliable.
The foundation for enterprise scale conversational ai.
Thank you all for your time and attention.
I hope the session gave you a useful insight into unique reliability
challenges of conversational AI systems and how SRE principles can address them.
I'm happy to take any questions or feel free to connect with me
later for deeper discussions.
Let's continue building AI that's not only smart, but also resilient,
scalable, and productivity.
Thank you all.
Bye-bye.