Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
My name is Tejasvi Nuthalapati and I work as a lead software development engineer at
Amazon, and I'm super excited to be here with you today presenting my topic.
So let's get started.
The title of my talk is Human in the Loop, ML Ops
Production Patterns That Boost Model Performance by 40% while
maintaining human Control.
Now it's now.
That's definitely a how a mouthful, but I want to set the stage.
Clearly, the session is not going to be abstract theory.
It is about real architectures, measurable outcomes and patterns that
you can apply in your own environments.
My goal is simple to show you that when we design ML LOP systems in a
way that balances automation with human judgment, we cannot only
improve performance, but also build systems that teams actually trust.
So if you've worked with machine learning and production, you
know the pain of extremes.
On one side we have full automation.
Everything is left to the model and the pipelines look clean and fast.
But underneath that Polish drift starts creeping in.
Something like false positives should show up.
Each case pile up and B long.
No one is really sure why the system is making the decisions.
It is making trust evaporates.
So on the other side we have a, we have like multiple manual bottlenecks.
Every decision requires a human check.
Accuracy might look good, but the system becomes slow and expensive.
CU burn out on repetitive work and throughput collapses.
You heard that right?
Both of these extremes fail us, and yet too often organizations think the solution
is to just double down on whichever extreme they have already chosen.
More automation or like more humans.
But the truth is, may the path is sustainable.
So let me talk to you about the third.
This is a new concept.
Um, so instead, humans versus machines.
It is humans with machines.
That's the third way thinking of it as designing systems where AI brings scale,
speed, and pattern recognition, and humans bringing judgment, context and ethics.
Each is doing what they do best at what they do, and the pipeline treats
both as first class participants.
Here's a story that illustrates a simple point.
I was researching about a FinTech com, uh, company, and, um, it was drowning in false
positives in their fraud detection system.
Transactions were consistently being flagged and estimate.
Customers were furious.
Initially, they tried tuning the models, but the improvements
were a bit teeny tiny.
What made the real difference was embedding structured human
review into the pipeline.
Analysts weren't just fixing mistakes on the site, their decision were
logged, versioned, and directly tied to the retraining cycles.
So finally, what's the result?
It's the false positives dropped by 40%.
It essentially means there's an uplift that that just didn't come
from some breakthrough algorithm.
It came from rearchitecting the system so that humans and machines
were truly collaborative and collaborating at the same time.
I'm going to talk to you about a few patterns that have identified,
and let's go one after the other.
First one is the tiered autonomy.
This essentially enables this collaboration that is something,
um, I call it as a tiered autonomy.
So let's go back to the fraud detection example.
Low value, low risk transactions where the model could have like
high, very high confidence.
Those could be auto approved.
Medium risk cases where the model is a little bit uncertain should
be routed to a human analyst, which can bring in context to the problem.
And high risk cases involving large sums of money should be
escalated to full human review.
But with AI generated insights and supporting data attached to speed
of decision making, from a technical standpoint, this can be implemented
in multiple different waves.
So let's start with service measures that, uh, handle routing.
Uh, CICD branching that responds to confidence thresholds or even, uh,
Kubernetes operators that govern, which requests require intervention.
But the technology is not the heart of the pattern.
The heart of it is, um, essentially designing autonomy as a spectrum in stuff.
A binary choice in one production deployment, um, the tier autonomy
reduced analyst workload by 60%.
Analysts were no longer wasting time on trivial cases and their expertise.
Which was essentially, um, must be concentrated towards what matters the
most and the throughput increased.
Um, and finally the customer trust also went up.
So the system as a whole became much more reliable.
So I'm also talk, going to talk about the pattern and the second pattern,
which is the observable ML model.
Which is essentially absorbability in a fancy way.
So if you think about it, in traditional software engineering,
we would never deploy a service without logs, metrics and traces.
Yet with machine learning models, too many organizations throw models into
production as opaque black boxes too bad.
So imagine a recommendation system that suddenly starts suggesting
irrelevant or outdated products.
How would you see that?
Just imagine.
So without visibility into confidence scores.
Feature drift or reasoning traces, you won't realize there's a problem
until customers start complaining.
By then, trust has already eroded.
The solution is to bring absorbability patterns or the principles into ML system.
That means structured decision logging.
And that also means, it means, uh, building dashboards that show confidence
distributions, um, and also drift indicators along with feature stability.
That means explanations that are layered so that engineers, business stakeholders,
and even end users can understand decisions at the level they need.
When models become observable, two things happen.
First, operators can catch drift and bias early before it becomes a crisis.
And second, trust grows stakeholders who might otherwise rest a state
adoption, start to see the system as something transparent and
manageable, but not as a black box.
This is a huge change in perspective if you think about it.
The last and the most, much more important one is the human in the loop pipeline.
So the, this is the one that I'm, um, more interested in driving the point.
So right now, in many organizations, human feedback is treated as an afterthought.
Yeah, a human spot's a mistake, files a ticket, and maybe that
data makes it way back into retraining month, many months later.
So that process is far too slow, and by the time it completes the,
the system might have already drifted far, far farther apart.
So instead, we should treat human feedback as first class data.
Every human correction should be logged, tied to the
specific model version that it.
Um, like it began and was in the pipeline, that feedback should be automatically
incorporated into retraining, so the system continuously learns.
Let's look at content moderation as an example.
When a moderator flags post that model misclassified, that feedback is not
just noted, it is captured, version, and fed right back into the pipeline.
When you see this over time, the model becomes sharper, more
aligned with real human judgment.
In one system, I observed this loop raised the production F1 score from 0.72
to 0.84 in just a few three training cycles, a 16% lift in measurable accuracy.
That improvement didn't just come from new features or maybe new algorithms.
It just came from designing the pipeline so that human judgment flowed directly
into the model's learning process.
And here's the few key points.
As models get stronger, human input, bigger doesn't just become less valuable,
it becomes more valuable because humans are the only ones who can guide
systems through the messy, ambiguous, ethically sensitive cases that no
model, sorry, that no model will have a fully master, which is what I said.
So the final one is the scalable oversight.
Which deals with the reality of scale.
As systems grow, you cannot have humans reviewing every single decision of yours.
That simply doesn't scale.
But if you reduce human oversight too much, you expose yourself to risk.
The answer is adaptable.
Oversight, you guessed it.
Right?
Uh, so you stratify risk levels.
You, you use, you just use intelligent sampling and you design escalation parts.
Take supply chain, um, anomaly detection as an example.
Instead of reviewing every alert, the team over sample low conference
cases and routed them to humans.
By doing so, they caught 90% of novel anomalies, thinks the model
had never seen before, while cutting overall human workload in half.
That's what scalable oversight looks.
Human attention grows, subline.
Subline yearly.
With system growth, you still have the quality and the trust that human review
provides, but without overwhelming your people, just what I said.
So how do you measure success in human, in the loop systems?
You don't stop at model accuracy.
You just don't stop there.
You have to measure the system, humans as well as machines together.
That means tracking the reduction in false positives.
It also means measuring time to drift detection.
It means calculating how quickly human feedback makes its way into retraining.
It even means surveying operator confidence because a system that humans
trust is one that they actually use effectively across multiple organizations.
Um, also, uh, let me forget, uh, not forget this.
So these patterns have consistently driven around a 40% improvement
in effective performance.
Again, not because the model themselves got radically better, but
because the system just not just.
It's the system.
The collaboration between humans and machines were, was
architected more intelligently.
Let's look for.
Farther ahead.
The future of ML lops is not about squeezing humans out of the loop.
It's just about amplifying them.
Machines will continue to handle scale, speed and repetitive pattern
recognition, but humans will continue to provide oversight,
creativity, and ethical grounding.
When we design systems with that balance, we don't just get models that are a
few percentage points more accurate.
We get systems that organizations actually trust.
We get pipelines that operators can debug as well as improve.
And also let's not forget, and we get gen AI that end users feel confident depending
on, so my closing thought is this, stop thinking of humans and AI as substitutes.
Start thinking of them as co-collaborators.
This is where the real 40% uplift comes from, and that is how we build
the next generation of production ml, lops systems, systems that are
powerful, scalable, and trustworthy.
Trustworthy.
Thank you.
I.