Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning everyone.
My name is Yash Ri, and today I'll be discussing AI driven testing
for conversational AI systems.
I've spent the last decade working on voice assistance like Alexa and Siri, and
one challenge remains constant testing conversational AI systems tone scale.
In this talk, I'll show how we can leverage L LMS themselves as judges, test
case generators and automation drivers.
To redefine quality assurance for enterprise grade ai,
what's the challenge?
Conversational AI is unpredictable.
There's no finite set of inputs.
Every user, every phrasing, every language variant can shift in dent.
For instance, users might say, set an alarm for seven, or Wake me up at seven
sharp, or Get me up before sunrise.
Traditional QA assumes you can enumerate test cases.
That model collapses here.
Manual review is expensive, inconsistent, and cannot keep up
with the pace of model updates.
So the testing gap keeps widening.
We build smarter bots, but the but test them with legacy methods.
That's the problem we are trying to solve here.
So our solution is an intelligent end-to-end QA framework.
Our approach builds an AI powered QA loop.
It's built on three pillars, LLM as a judge.
Automated evaluation that replaces manual review, generative test data,
AI generated diverse test cases, and CICD integration automation
embedded into every build cycle.
Together they create an end-to-end pipeline.
Where conversational AI can test, evaluate, and validate
itself continuously at scale.
Now, what is LLM as a judge?
Think of LLM as a judge, like an automated senior QA engineer, one that never sleeps,
never biases and scales infinitely.
For example, given a chat bots response, your package will arrive next week.
The LLM as judge can analyze if that's accurate.
Based on the context, was it actually supposed to arrive tomorrow?
It evaluates semantic accuracy, relevance, safety, and tone in
seconds across thousands of dialogues.
That's how we move QA from being manual and reactive to automated and continuous
core components of ala.
As a judge, there are four pillars, standardized prompts.
Structured evaluation templates like rate this response for accuracy, tone, and
user satisfaction, calibration data sets, allowing aligning the judge models, scores
with human expert ratings, statistical validation, verifying correlation with
human reviewers and multi judge consensus.
Multiple models vote independently to minimize bias.
For example, you might have GT four, Claude, and Gemini, all judged the same
dialogue and only consensus counts.
That's quality control through diversity.
What LLM as a judge evaluates our evaluation spans across
four axis semantic accuracy.
Did the AI actually answer what was asked?
Contextual relevance.
Does it remember and respond in context, safety and compliance.
No toxicity, no hallucinations.
User experience, tone, helpfulness, and naturalness.
These become quantifiable metrics you can monitor over time.
Transforming conversation quality into measurable data, ensuring
reliable bias of our evaluation.
Of course automation doesn't mean abdication.
We test across demographics and scenarios to keep surface hidden biases.
We keep transparent scoring rubrics for auditability, and we use human
in the loop calibration periodically, ensuring our models stay aligned
with real world expectations.
Bias aware validation isn't just ethics, it's reliability insurance.
Generative AI for test data creation.
The other half of this framework is generative qa.
Manually authoring test cases is slow, and by definition, limited.
Generative AI can produce millions of realistic, diverse
utterances, paraphrases, adversarial phrasing, multilingual variables.
It can instantly create.
How much money do I have or show me my account total, or tell me if I'm broke.
This changes the QA paradigm.
Instead of discovering edge cases after deployment, we generate them proactively.
This results in faster discovery of weaknesses before they reach production.
What are the intelligent test data generation techniques?
We employ several generation strategies.
Multilingual data for global consistency, semantic paraphrasing, for example,
what's the weather in Paris becomes, do I need a JAK in Paris today or do
I need an umbrella in Paris today?
Same meaning, but different paraphrasing or different phrasing, adversarial inputs.
They can be edge cases like book me the cheapest, non cheap flight.
It's advers adversarial, right?
Or sarcasm sure, because airlines never delay.
Now this is designed to break logic.
So an intelligent test data generation technique can
generate such edge cases as well.
Next is noisy utterances, real world typos and speech errors like, can
you book a flight for tomorrow we can generate phrases and entrances like this.
Test resilience to imperfect inputs.
The point isn't to trick the system, it's to make sure it can
handle how humans really speak.
Schema driven prompting for controlled generation.
To avoid chaos, we use schema driven prompting, formal definitions for
what kind of test cases to generate.
Each schema specifies intense entities, languages, tone, and difficulty.
This gives us automation with precision, the ability to
regenerate consistent test sets for regression testing across releases.
It's structured creativity, not random
automated annotation and labeling.
Once the data is generated, it needs labels.
We use the same generative models to produce expected, intense, and
entities with multimodal consensus for reliability, low confidence
cases get flagged for human review.
Creating a closed loop system, this means we can fully automate from data generation
to evaluation dramatically accelerating QA throughput without losing quality
CI ICD integration for continuous qa.
We plug all of this directly into the CICD pipeline.
A code commit triggers generation of fresh test cases.
LLM as a judge evaluates responses, regression detection, flags, any quality
drop and deployment gating ensures only validated bills reach production.
This gives us continuous quality assurance.
It's like a QA that runs 24 7, not just before a release.
What are the benefits?
The payoff is massive, less manual effort.
The models handle the grind faster release cycles, QA runs
automatically wider coverage.
Generative data explores the long tail and continuous monitoring.
Every interaction becomes a quality signal.
It's QA that scales with your development velocity, not against it.
Tactical implementation insights.
A few lessons from real world deployments.
Start with high impact user journeys.
That's where ROI is highest.
Define quantitative KPIs like accuracy and safety scores.
Keep humans in the loop, especially for emerging topics.
Version.
Control your prompts and schemas.
Treat them like production code and regularly update your judge models to
stay aligned with LLM advancements.
This discipline is what turns research ideas into production pipelines.
Building Future Ready Testing pipeline.
We are entering an era where AI tests ai, a self-sustaining QA ecosystem.
By combining automated evaluation with generative test creation, we
gain both scale and consistency.
It's not just about speed, it's about trust at scale, ensuring every
conversational AI deployment is safe, accurate, and user-centric.
This is how we build the next generation of reliable enterprise
grade conversational systems.
Thank you all, and if your teams are building or testing conversational ai,
now's the time to let AI help validate ai.
I would like to thank con 40 two.com for giving me this opportunity to
speak here and have a good day.
Thank you so much.