Human-in-the-Loop MLOps: Production Patterns That Boost Model Performance 40% While Maintaining Human Control

Video size:

Abstract

Stop replacing humans with AI—amplify them instead! Learn production MLOps patterns that boost model performance 40% by keeping humans in the loop. Real architectures, battle-tested code, zero fluff. Transform your ML pipelines from black boxes into collaborative powerhouses.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. My name is Tejasvi Nuthalapati and I work as a lead software development engineer at Amazon, and I'm super excited to be here with you today presenting my topic. So let's get started. The title of my talk is Human in the Loop, ML Ops Production Patterns That Boost Model Performance by 40% while maintaining human Control. Now it's now. That's definitely a how a mouthful, but I want to set the stage. Clearly, the session is not going to be abstract theory. It is about real architectures, measurable outcomes and patterns that you can apply in your own environments. My goal is simple to show you that when we design ML LOP systems in a way that balances automation with human judgment, we cannot only improve performance, but also build systems that teams actually trust. So if you've worked with machine learning and production, you know the pain of extremes. On one side we have full automation. Everything is left to the model and the pipelines look clean and fast. But underneath that Polish drift starts creeping in. Something like false positives should show up. Each case pile up and B long. No one is really sure why the system is making the decisions. It is making trust evaporates. So on the other side we have a, we have like multiple manual bottlenecks. Every decision requires a human check. Accuracy might look good, but the system becomes slow and expensive. CU burn out on repetitive work and throughput collapses. You heard that right? Both of these extremes fail us, and yet too often organizations think the solution is to just double down on whichever extreme they have already chosen. More automation or like more humans. But the truth is, may the path is sustainable. So let me talk to you about the third. This is a new concept. Um, so instead, humans versus machines. It is humans with machines. That's the third way thinking of it as designing systems where AI brings scale, speed, and pattern recognition, and humans bringing judgment, context and ethics. Each is doing what they do best at what they do, and the pipeline treats both as first class participants. Here's a story that illustrates a simple point. I was researching about a FinTech com, uh, company, and, um, it was drowning in false positives in their fraud detection system. Transactions were consistently being flagged and estimate. Customers were furious. Initially, they tried tuning the models, but the improvements were a bit teeny tiny. What made the real difference was embedding structured human review into the pipeline. Analysts weren't just fixing mistakes on the site, their decision were logged, versioned, and directly tied to the retraining cycles. So finally, what's the result? It's the false positives dropped by 40%. It essentially means there's an uplift that that just didn't come from some breakthrough algorithm. It came from rearchitecting the system so that humans and machines were truly collaborative and collaborating at the same time. I'm going to talk to you about a few patterns that have identified, and let's go one after the other. First one is the tiered autonomy. This essentially enables this collaboration that is something, um, I call it as a tiered autonomy. So let's go back to the fraud detection example. Low value, low risk transactions where the model could have like high, very high confidence. Those could be auto approved. Medium risk cases where the model is a little bit uncertain should be routed to a human analyst, which can bring in context to the problem. And high risk cases involving large sums of money should be escalated to full human review. But with AI generated insights and supporting data attached to speed of decision making, from a technical standpoint, this can be implemented in multiple different waves. So let's start with service measures that, uh, handle routing. Uh, CICD branching that responds to confidence thresholds or even, uh, Kubernetes operators that govern, which requests require intervention. But the technology is not the heart of the pattern. The heart of it is, um, essentially designing autonomy as a spectrum in stuff. A binary choice in one production deployment, um, the tier autonomy reduced analyst workload by 60%. Analysts were no longer wasting time on trivial cases and their expertise. Which was essentially, um, must be concentrated towards what matters the most and the throughput increased. Um, and finally the customer trust also went up. So the system as a whole became much more reliable. So I'm also talk, going to talk about the pattern and the second pattern, which is the observable ML model. Which is essentially absorbability in a fancy way. So if you think about it, in traditional software engineering, we would never deploy a service without logs, metrics and traces. Yet with machine learning models, too many organizations throw models into production as opaque black boxes too bad. So imagine a recommendation system that suddenly starts suggesting irrelevant or outdated products. How would you see that? Just imagine. So without visibility into confidence scores. Feature drift or reasoning traces, you won't realize there's a problem until customers start complaining. By then, trust has already eroded. The solution is to bring absorbability patterns or the principles into ML system. That means structured decision logging. And that also means, it means, uh, building dashboards that show confidence distributions, um, and also drift indicators along with feature stability. That means explanations that are layered so that engineers, business stakeholders, and even end users can understand decisions at the level they need. When models become observable, two things happen. First, operators can catch drift and bias early before it becomes a crisis. And second, trust grows stakeholders who might otherwise rest a state adoption, start to see the system as something transparent and manageable, but not as a black box. This is a huge change in perspective if you think about it. The last and the most, much more important one is the human in the loop pipeline. So the, this is the one that I'm, um, more interested in driving the point. So right now, in many organizations, human feedback is treated as an afterthought. Yeah, a human spot's a mistake, files a ticket, and maybe that data makes it way back into retraining month, many months later. So that process is far too slow, and by the time it completes the, the system might have already drifted far, far farther apart. So instead, we should treat human feedback as first class data. Every human correction should be logged, tied to the specific model version that it. Um, like it began and was in the pipeline, that feedback should be automatically incorporated into retraining, so the system continuously learns. Let's look at content moderation as an example. When a moderator flags post that model misclassified, that feedback is not just noted, it is captured, version, and fed right back into the pipeline. When you see this over time, the model becomes sharper, more aligned with real human judgment. In one system, I observed this loop raised the production F1 score from 0.72 to 0.84 in just a few three training cycles, a 16% lift in measurable accuracy. That improvement didn't just come from new features or maybe new algorithms. It just came from designing the pipeline so that human judgment flowed directly into the model's learning process. And here's the few key points. As models get stronger, human input, bigger doesn't just become less valuable, it becomes more valuable because humans are the only ones who can guide systems through the messy, ambiguous, ethically sensitive cases that no model, sorry, that no model will have a fully master, which is what I said. So the final one is the scalable oversight. Which deals with the reality of scale. As systems grow, you cannot have humans reviewing every single decision of yours. That simply doesn't scale. But if you reduce human oversight too much, you expose yourself to risk. The answer is adaptable. Oversight, you guessed it. Right? Uh, so you stratify risk levels. You, you use, you just use intelligent sampling and you design escalation parts. Take supply chain, um, anomaly detection as an example. Instead of reviewing every alert, the team over sample low conference cases and routed them to humans. By doing so, they caught 90% of novel anomalies, thinks the model had never seen before, while cutting overall human workload in half. That's what scalable oversight looks. Human attention grows, subline. Subline yearly. With system growth, you still have the quality and the trust that human review provides, but without overwhelming your people, just what I said. So how do you measure success in human, in the loop systems? You don't stop at model accuracy. You just don't stop there. You have to measure the system, humans as well as machines together. That means tracking the reduction in false positives. It also means measuring time to drift detection. It means calculating how quickly human feedback makes its way into retraining. It even means surveying operator confidence because a system that humans trust is one that they actually use effectively across multiple organizations. Um, also, uh, let me forget, uh, not forget this. So these patterns have consistently driven around a 40% improvement in effective performance. Again, not because the model themselves got radically better, but because the system just not just. It's the system. The collaboration between humans and machines were, was architected more intelligently. Let's look for. Farther ahead. The future of ML lops is not about squeezing humans out of the loop. It's just about amplifying them. Machines will continue to handle scale, speed and repetitive pattern recognition, but humans will continue to provide oversight, creativity, and ethical grounding. When we design systems with that balance, we don't just get models that are a few percentage points more accurate. We get systems that organizations actually trust. We get pipelines that operators can debug as well as improve. And also let's not forget, and we get gen AI that end users feel confident depending on, so my closing thought is this, stop thinking of humans and AI as substitutes. Start thinking of them as co-collaborators. This is where the real 40% uplift comes from, and that is how we build the next generation of production ml, lops systems, systems that are powerful, scalable, and trustworthy. Trustworthy. Thank you. I.

Slides

Download slides (PDF)

See all 37 talks at this event!

Conf42 MLOps 2025 - Online

September 18 2025 - premiere 5PM GMT

Human-in-the-Loop MLOps: Production Patterns That Boost Model Performance 40% While Maintaining Human Control

Video size:

Abstract

Summary

Transcript

Slides

Tejasvi Nuthalapati

Software Development Engineer @ Amazon

Join the community!

Featured event

2026

2025

Info

Conf42 MLOps 2025 - Online

September 18 2025 - premiere 5PM GMT

Human-in-the-Loop MLOps: Production Patterns That Boost Model Performance 40% While Maintaining Human Control

Video size:

Abstract

Summary

Transcript

Slides

Tejasvi Nuthalapati

Software Development Engineer @ Amazon

Join the community!