MLOps at Scale

Video size:

Abstract

As ML models power mission-critical applications, scalable MLOps is key to reliability and efficiency. This talk explores automating ML pipelines, CI/CD for ML, model monitoring, and governance. Learn best practices to streamline deployment and ensure responsible AI at an enterprise scale.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. Thank you for joining. My talk today is on ML Ops at scale. Before we dive into ML ops let me quickly introduce myself. I've been in the data analytics and machine learning space for over 15 years, building large scale platforms, predictive models, and intelligent system across industries. Over the years, I've seen data evolve from simple dashboard to mission critical. AI systems that influence billions of decisions. Every single day in my career, I have led teams that designed end-to-end ML pipelines, architected the petabyte scale data warehouse, and deployed models, powering finance, hr, and other use cases. And more recently, my focus has expanded to world of agent AI building autonomous AI agents that don't just predict but can act, collaborate, and make decision with other systems across all of this. One lesson that has stayed constant is building a model is easy. But operationalizing it at scale is the real challenge. That's why ML ops and and now ML ops and agent DKI ops are so critical. They ensure models are reliable, auditable, and scalable turning experimental prototypes into real business solutions. That's why I'm passionate about ML Ops. This is the foundation that allows enterprise to move from experimentation to impact safely, reliably, and at scale. Let me move on to the next slide. I wanna make sure that I give some disclaimer. All the views and opinions shared in this presentation are my own perspective. They don't represent or reflect the views of Amazon or AWS I'm just sharing this purely out of my passion for ML ops and ai. All right. I wanna start with a simple idea. A model in a Jupyter Notebook is an experiment, a model in production, monitored and gone at scale is a business solution. Think about how critical this is at companies like Amazon, Netflix, or major banks. Machine learning system drive billions of decisions, recommendations fraud detection, supply chain optimization or even workforce planning. If these model fail the business feels it's very instantly. Now we are also entering the era of large language models. They're powerful, but they're also unpredictable. ML ops is what keeps the system reliable and trustworthy. My goal today is to show how automation CICD, monitoring and governance ensure just that let's talk a little bit on the challenges. Traditional ML project often fail because pipelines are brittle. Maybe a column name changes in the source data, and suddenly the feature store breaks. Deployments are often manual. Someone copies of model file pushes it into production and then sprays. It works. And then there is monitoring, or rather the lack of it. Many enterprise have models running in production right now that haven't been retrained in months or even years. That's a big problem. Let me give you an example, A credit risk model at a financial firm that was trained on pre 2020 data when COVID hit consumer behavior, changed overnight, job loss, losses loan defaults. Government relief checks and so on. But the model was sent, monitored, and for months, it misclassified risk leading to massive financial losses. Now brings in LLM, or now let's bring in the LLM. They add even bigger risk cost explosion, hallucination, or compliance gaps. So if ML ops was important before in LLM world it's absolutely mission critical. All right. Let's begin with what is ML ops? ML Ops is about taking the best of dev ops, continuous integration, continuous delivery, infrastructure automation, and applying it to machine learning. But there's one more. DevOp DevOps deals with code while ML ops deals with code data and models. Let's break it down. Data pipelines feed features into the model. The model itself is trained, validated, and washed. Deployment must be automate automated across environments and monitoring keeps track of the model performance over time. And now with large language models. We need to extend this thinking to what some call LLM Ops. Where we manage not just training and deployment, but also prompt pipelines, retrieval, augmentation, and fine tuning. The principles remains the same. Automate, monitor and govern. I wanna talk about ML ops in the large language model era. As we shift now into the age of large language model. Some people ask. Do we still need ML labs? And I think the answer is, and in fact it's, this is a big yes. If anything, we need it even more. Why? Because the complexity of operating large language model is even higher than classical machine learning. Just like we manage feature pipelines for machine learning in large language, we need. Are we now manage prompt pipelines prompts evolve. Chains of prompts need testing and they must be version control. Fine tuning and large language model is very expensive. Without automated evaluation pipeline, you risk wasting millions retraining models unnecessarily. Monitoring also becomes bo broader. We don't just watch accuracy or latency. We track hallucination, toxicity, bias and grounding. For example, a large language model chat bot in healthcare starts producing producing advice, not grounded in medical solutions. We need guardrails to catch that instantly. Finally, governance. With machine learning, we tracked. Data sets and model versions with LLM, we must also track data compliance and human feedback loops. ML ops gives us the foundation to make lms reliable, responsible, and scalable in enterprise context. Without it large language model adoption will stall because enterprise won't trust them in mission critical use cases. Let's move on to ML lops lifecycle. The ML of lifecycle looks like a loop not a straight line. It begins with data ingesting in pre-processing, validating data, detecting schema changes, applying feature engineering, and so on. Then comes model training and fine tuning. Often tracked in ML flow or SageMaker with hyper parameters and metrics. Next is validation. This isn't just accuracy, it's fairness, robustness, and interpretability then comes deployment. This could mean exposing an API containerizing in model or serving it via a feature store. Finally monitoring and feedback loop closes the entire cycle. We monitor accuracy, drift, and latency. If something falls out of the bound retraining triggers automatically. A real world example a retail store recommendation models retrain daily with new browsing and purchase data. They're validated against test sets, rolled out with shadow deployments and monitored for drift for large language models. This lifecycle also includes prompt chain testing, ground checks, and human in the loop review to ensure the output remains useful and safe. All right. Automation I think automation is the first pillar without it, ma a machine learning model becomes artisanal slow and consistent and fragile. Take a retail company running pricing optimization. If a data engineer has to manually rerun the pre-processing scripts, every time new sales data arrives, they'll never keep up with the changing market condition. By automating the pipelines with airflow or QB flow retraining happens daily and reproducibly for LLMs. Automatic automation is equally cr crucial. As I've mentioned before, imagine a chat bot trained on enterprise documentation. Automation ensures that when new docs are published the retrieval index is refreshed prompts are validated and evaluation. Test are triggered. No one has to manually intervene. Automation ensures that models are not one off science experiments. They le their living systems. CICD for ML and large language models. CICD bring agility and discipline in machine learning. We work in not just code, but data sets and models. Tools like DVC or ML Flow make this possible. Every new data set goes through regression test, bias test, and unit test for features. Example let's say at Netflix, thousands of models are deployed using CICD pipelines. If one experiment fails, it can be rolled back instantly for large language models. CICD includes prompt worsening, so you have a customer service chat bot. Each new prompts template gets tested against historical customer interaction, shadow deployment let you compare the new version against the old before fully rolling it out. And just like microservices, rollback strategies are essential because the cost of a bad model or prompt in production can be huge. Moving on to governance and responsible ai governance is all about trust and compliance. Enterprise needs to answer which model made this decision? What data was it trained on? Can we prove it wasn't biased? And financials services regulators may demand proof of why a loan was denied. With ML ops governance, you can show the dataset, the model version, even the code used. LLMs take it. The, take this a little bit further. We must track data where training or fine tuning data came from red teaming results. What failure modes were tested and regulatory compliances like GDPR or eu AI Act are very explicit about AI transparency. Responsible AI is not optional. It's the price of doing business with with ai model and LLM training. So monitoring is where the battle is won or lost. For classical ml, we monitor accuracy, latency, drift, and infrastructure cost. For example, a fraud detection at a bank looked great at launch, but started missing new fraud patterns. After six months, continuous monitoring detected drift earlier early and triggered retraining, saving millions. LM introduced another. Monitoring challenge the hallucination rate the toxicity grounding our answers tied back to real data sources. Some enterprise now run red teaming in production monitoring outputs for unsafe responses, and automatically flagging them for human review. Without monitoring, large language model can quickly become liabilities. I wanna talk about, some of the best practices in the large language model era. So what works well use infrastructure as a code. So every environment is reproducible. Adopt feature stores for ML and emerging prompt stores for LLMs, automate bias testing, fairness checks and safety evaluation as part of CICD, build cross-functional teams because. Reliable AI requests, not just data scientists, but engineers, compliance offices, and business leaders. Moving on to my final slide. This is just my closing and call to action to close. Let me emphasize this. ML OPS is not outdated. It is evolving. It is the foundation that allows enterprise to scale both classical and machine learning and modern large language models responsibly. Without automation, CICD, monitoring and governance models will fall or fail silently cost too much and expose you to regulatory risk with them. AI becomes reliable, trustworthy, and scalable. So whether you're deploying a custom churn model or recommendation engine, or a generative AI assistant, remember ML ops is the bridge between innovation and trust. With that, I want to thank everyone for taking the time to listen into my talk. And thank you very much.

Slides

Download slides (PDF)

See all 37 talks at this event!

Conf42 MLOps 2025 - Online

September 18 2025 - premiere 5PM GMT

MLOps at Scale

Video size:

Abstract

Summary

Transcript

Slides

Naveen Edapurath Vijayan

Head of Engineering, AWS Central Economics @ AWS

Join the community!

Featured event

2026

2025

Info

Conf42 MLOps 2025 - Online

September 18 2025 - premiere 5PM GMT

MLOps at Scale

Video size:

Abstract

Summary

Transcript

Slides

Naveen Edapurath Vijayan

Head of Engineering, AWS Central Economics @ AWS

Join the community!