Machine Learning in Production: An Intro to MLOps

Video size:

Abstract

Reliably deploying and maintaining machine learning applications is complex. There’s a dizzying array of tools and they look different from the usual DevOps tools.

To apply SRE skils to ML, we need to understand the specific challenges of ML build-deploy-monitor workflows. We’ll use reference examples to understand the cycle in terms of data prep, training, rollout and monitoring. We’ll see that some key challenges relate to training models from slices of large and varying data domains - a problem alien to the mainstream DevOps world.

Summary

This session is an introduction to running machine learning in production. The MLops scene is complex and new. It's distinct from mainstream DevOps. Challenges vary by use case. Some use cases have especially advanced challenges.
Machine learning is different from DevOps because it captures patterns from data. This makes machine learning more applicable to problems that center on data. These differences have implications for how we can best build, deploy, and run machine learning systems.
A machine learning build journey starts with some data and maybe a question. Get data, clean it, experiment with it, train a model, package the model into something that can serve predictions. There are tools pitched at each stage of this lifecycle for data storage and prep. Platforms can save time and effort stitching together several different tools.
An ML workflow starts with data. It can be very large and typically needs to be cleaned and processed. That model can be integrated into a running app to serve real time production. Monitoring can need more than one metric depending on the use case.
In many organizations right now, the complexity is enhanced by challenges from organizational silos. There are special challenges for mlops that are not a normal part of DevOps, at least not right now. This is new territory.
First into training, then serving, finally rollout and monitoring. There's tools that are pitched, particularly at the training space. Serving solutions often come with support for rollout and some support from monitoring as well.
This idea of taking the live request and asynchronously sending it elsewhere can also be useful for some monitoring use cases. One thing we might need to detect is an outlier. This is when there's the occasional data point which is significantly outside of the training data distribution. If your use case has risk associated with those, then you might want to detect and track outliers.
These monitoring and prediction quality concerns also feed into governance for machine learning. Detection for data drift and outliers another thing detectors might be applicable for is adversarial attacks. This relates to the topic of explainability.
MLOPs enables ML workflows. It provides tools and practices for enabling training runs and experiments that are very data intensive. There are tools and approaches for monitoring models running in a production environment. That's my perspective on the field of mlops right at least as it is right now.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Machine learning in production this session is an introduction to running machine learning in production, which is being called MlOps. I'm Ryan Dawson and I'm an engineer working on Mlops solutions at Seldon. The MLops scene is complex and new. It's distinct from mainstream DevOps. So we'll start by comparing mlops to DevOps. To understand why it's so different, we need to understand how data science is different from programming. We'll find out that the difference centers on how data is used. When we're clear about that difference, then we'll look at how the build deploy monitor workflows for DevOps differ from Mlops. From there, we'll be able to go deeper on particular steps in the mlops build deploy monitor workflow. I'll try to explain that MlOps challenges vary by use case, and that some use cases have especially advanced challenges. Lastly, I'll go into some of the advanced challenges and how they relate to the topic of governance for running machine learning. So before we try to understand mlops, let's make sure we're clear about DevOps. As I see it, DevOps is all about making the build deploy monitor workflow for applications as smooth as possible. It tends to focus on CI, CD and infrastructure SRE or site reliability engineering. As I see, it is an overlapping role, but with a bit more focus on the monitoring stage of the workflows. This whole workflows is a key enabler for software projects. Fortunately, there's some great tools in the space that have become pretty well established across the industry, tools like git, Jenkins, Docker, Ansible, Prometheus, et cetera. MLOPS is in a very different space right now. There's surveys suggesting that 80% to 90% of machine learning models never make it to live, and at least part of that is due to the complexity of running machine learning in production. There's a famous paper called hidden technical debt in machine learning systems, and it explains about all the effort that goes into running production grade machine training systems. It has a diagram with boxes showing the relative size of different tasks, and there's this tiny little box for ML code and really big boxes for data collection, data processing, runtime, infrastructure monitoring. The Linux foundation for AI have tried to help by producing a diagram of the whole mlops tool landscape. It's great, but it has loads of tools in loads of sections, and even the section titles won't make much sense for newcomers to mlops. But let's try to understand more about the fundamentals of Mlops and where it's coming from. Fundamentally, MLops is different from DevOps because machine learning is different from programming. Traditional programming codifies rules explicitly, rules that say how to respond to inputs. Machine learning does not codify explicitly. Instead, rules are set indirectly by capturing patterns from data and reapplying the extracted patterns to new input data. This makes machine learning more applicable to problems that center on data, especially focused numerical problems. So with traditional programming, we've got applications that respond directly to user inputs, such as terminal systems or GUI based systems. You code these by starting with hello world and adding more control structures. Data science problems fall into classification problems. Regression problems classification problems put data into categories. An example would be, is this image a cat or not a cat? Regression problems look for numerical output, for example, predicting sales revenue from how advertising spend is directed. The hello world of data science is the mnist dataset, which is a data set of handwritten digits. And the problem is to categorize each handwritten sample correctly as the number that it represents. When I think of machine learning as capturing patterns from data, I think about fitting for regression problems basically have data points on a graph, and you draw a line through the data points and try to get the line as close to as many of the data points as possible. The distance from each data point to the line is called the error, and you keep adjusting the equation of the line to minimize the total error. The coefficients of the equation of the line correspond to the weights of a machine learning model, and you then use that to make new predictions. Of course, the machine learning training process is more complex than the way I'm explaining it. For example, there's more to the process of adjusting the weights than to try to get the line to fit the data. It's done promogrammatically by using an algorithm called gradient descent. Essentially randomly pick a way to shift the line, but it's only pseudoram, as it will take a step in a given direction and then check whether that reduced the error before deciding whether to keep going that way or go a different direction. That step size can be tweaked and you can get different results, so the overall process is tunable. So basically, data scientists are looking for patterns in data and trying to find which methods are best for capturing those patterns in models. This is an exploratory process, and the tools data scientists use reflect this. Jupyter notebooks, for example, are great for playing around with slices of data and visualizing patterns. These differences between programming and machine learning have implications for how we can best build, deploying, and run machine learning systems. So let's get into more detail about how different these build deploy monitoring journeys are. Let's go on an imaginary development journey. We can start with a user story. Let's say we're building a calculator and our user story says that our lazy users want to put numerical operations into a screen so they don't have to work out the answers. We could write a Java program to satisfy the story, compile it, and distribute it as a binary. But this is 2020, so we'll more likely package the code to run as a web server so that users will interact with it via a browser. Most likely we'll also dockerize the web app and run it on some cloud infrastructure. Now let's think of a machine learning build journey. This is more likely to start with some data and maybe a question. Let's say we've got data on employees and their experience and skills and salaries, and we want to see whether we could clean whether we could use it to benchmark salaries for other employees during a pay review. Let's assume the data is already available and clean, though this is a pretty big assumption. But let's assume we've got good data and we can create a regression models that maps employee experience to pay, maybe using scikitlearn. So we train the model and then it can be used to make a production for any given employee a prediction about what the salary benchmark would be. So let's say we give our predictions for a particular set of employees to the business and they're happy with that. So happy that they want to use it again next year or more regularly. Then our situation changes. Because then what we want isn't just a prediction but a predict function as we might not want to have to rerun the training process every time the business has some new employees to check. This problem would be magnified if another department says that they want to make predictions too. Actually, that would add extra complication as even if we know that the patterns from our training data are applicable to our department, we don't necessarily know about the new department. But let's assume that it is applicable. Then our main problem is a problem of scaling. How do we make all these predictions without burning ourselves out? Probably we're going to be interested in using the machine learning model in a web app. So maybe we add a rest API around our python code and look to run it as a web application. We might naturally package it in a docker container like we would for a traditional web app. This is a valid and common approach, but it's just one approach with machine learning deploying it does present a challenge about how to dockerize the predict function without including the training data in the docker image. So it's also common to separate the model from the data by taking the Python variable for the model production and serializing that to a file using Python pickling. Then the file can be loaded into another training Python application server. So if we load the model into a suitable Python web server app, then we can serve predictions that way. This varies a little from framework to framework, and can vary quite a lot if the language is not Python. But basically this is a good picture for the machine learning lifecycle. Get data, clean it, experiment with it, train a model, package the model into something that can serve predictions. And there are tools pitched at each stage of this lifecycle for data storage and prep. There's tools like s three and Hadoop training can use a lot of compute resource and take a long time. So there's tools that help with running long running training jobs, and also tools for training the operations performed during training. There are tools specifically aimed at helping make batch predictions on a regular cycle, say for just getting predictions every month or whatever the cycle is that the business works to. Or predictions could be needed at any time. And then there's tools for real time serving of predictions using a rest API. Some real time serving tools are specific to the framework and some SRe more general. I personally work on seldon core, which is a framework agnostic open source serving tool. The seldon team also collaborates on another tool called KF serving. Both of these are part of the Kubeflow ecosystem, which is an end to end platform. That's another space of tools, end to end platforms that try to join up the whole journey. Platforms can save you the effort of stitching together several different tools, but platforms are also opinionated can be, so they don't necessarily fit every use case. I'm listing these types of tools because I think it helps to divide the machine training lifecycle up like this into data prep, training and serving. This helps us make sense of the concept landscape of mlops tools out there, as we can then put them into categories mapped to the lifecycle. There's also the monitoring part of the lifecycle, but we'll get to that later. For now, the key point to see is that mlops is different from DevOps, mostly because of the role of data. In particular, models are built by extracting patterns from data using code, so that the training data is a key part of the model. The training data volumes can be large, and that leads to complexity in storing and processing the data, which there's specialized tools to help with. You also get different toolkits for building machine learning models, which results in models for different formats and adds some complexity to the space of tools for getting predictions out of models, space called serving. So the complexity of the way the ML build deploying monitor lifecycle uses data has knock on effects to the tool landscape. We've not talked about the post deployment stage yet, but there's also complexity there. For example, you can sometimes need to retrain your model, your running model, not because of any bugs in it, but because the data coming in from the outside world changes. Think, for example, of how fashion is seasonal. Let's say you've got a model trained to recommend clothes for an online fashion store, and you trained it based on purchases made in winter. Then it might perform great in winter and make lots of money. But when it comes to summer, it's still going to be recommending coats when people are looking for summer clothes. So you would need to be regularly updating the model with new data and ideally checking that it's leading to sales. That's a complex you don't normally get with traditional software. These complexities about handling data, they ripple all the way through the whole mlops lifecycle. We've talked about this at a high level so far, but let's now think about the individual steps of the workflow and the tools used in them. So let's just remind ourselves of the workflows steps with traditional DevOps. We'll start with the user story specifying a business need. From that a developer will write code and submit a pull request. Hopefully test will run automatically on the pull request. Somebody will review it and merge. It gets to merged to master there, our pipeline will build a new version of the app and deploy that to the test environment. Perhaps further tests will be run and it'll get promoted to the next environment where there might be more deeper tests, and then it'll go to production. And in production we'll monitor for anything going wrong, probably in the form of stack traces or error codes. The pipeline producing these builds and running the tests will most likely be a CI system like Jenkins. The driver for the pipeline will most likely be a code change in git. The artifact we'll be promoting will probably be an executable inside a docker image. ML workflows are different. The driver for automation might be a code change, or it might be new data, and the data probably won't be in git as git isn't a great store for data getting into the gigabytes, the workflows are more experimental and data driven. You start with a data set and need to experiment to find usable patterns in the data set that can be captured in a model. When you've got a model, then it might not be enough to just check it for past fail conditions and monitor for errors like you would. Traditional software likely have to check how well it performs against the data. In numerical terms, there can be quite a bit of variation with ML workflows. One major point of variation is whether the model is trained offline or online. With online learning, a model is constantly being updated by adjusting itself through each new data point that it sees. So every prediction it makes also adjusts the model. Whereas with offline learning, the training is done separately from prediction. You train the model and deploying it, and when you want to update the model, you need to train a new one. We have to pick somewhere to focus, and offline learning is probably the more common case. So let's focus on offline training workflows. As we've talked about already, an ML workflow starts with data. It can be very large and typically needs to be cleaned and processed. A slice of that data can be taken so that the data scientists can work with it locally to explore the data on their own machine. When the data scientist has started to make some progress, then they might move to a hosted training environment to run some longer running experiments on a larger sample of the data. There will likely be collaboration with other data scientists, most likely using Jupyter notebooks. The artifact produced will be a model, commonly a model that's pickled or serialized to a file. That model can be integrated into a running app to serve real time production through HTTP. There will probably be a consumer of those predictions, which may be another app, perhaps a traditional web app. So you may need to integration test the SRE model against the consumer. And when you roll out the model to a production, you want to monitor the model by picking some metrics that represent how well performing against the live data, the rollout and monitoring phases of the workflow can be linked. An example might help to understand this. Say we've got an online store with ecommerce. A common way to roll out new versions of a model is an A B test. With an A B test, you'd have a live version that's already running and that's called the control. And then you run other versions alongside it. Let's call them version a and version b. So we're running three versions of the model in parallel, each training a bit differently to see which gives the best results. You can do that by splitting the traffic between the versions to minimize the risk. We'd send most of the traffic to the control version. A subset of the traffic will go to a and to b, and we'll run that splitting process for a while until we've got a statistically significant sample. Let's say that variation a has the highest conversion rate, so a higher proportion of the recommendations lead to sales. That's a useful metric, and it might be enough for us to choose variation a, but it might not be the only metric. These situations can get complex. For example, it might be that model a is recommending controversial products. So some customers might really like the recommendations and buy the products, but other customers are really put off and they just go to a different website. So there are trade offs to consider, and monitoring can need more than one metric depending on the use case. So we're seeing that MLOPs is complex, and in many organizations right now, the complexity is enhanced by challenges from organizational silos. You can find data scientists that work just in a world of Jupyter notebooks and model accuracy on training data that then gets handed over to a traditional DevOps team with the expectation they'll be able to take this work and build it into a production system. Without proper context. The traditional DevOps team is likely to look at those notebooks and just react like, what is this stuff? In a more mature setup, you might have better understood handoffs. For example, you might have data engineers who deal with obtaining the data and getting it into the right state for the data scientists. Once the data is ready for the data scientists, then they can take over and build the models. And from there, data science will have an understood handoff to ML engineers. And the ML engineers might still be a DevOps team, but a DevOps team that knows about the context of this particular machine learning application and knows how to run it in the production. This is new territory. There are special challenges for mlops that are not a normal part of DevOps, at least not right now. Now that we've got a high level understanding of where MLOPs is coming from, we can next go into more detail on particular MLOps topics. So let's take these in order and go first into training, then serving, finally rollout and monitoring. So there's tools that are pitched, particularly at the training space, to name a few examples. There's Kubeflow pipelines, MLflow Polyaxon. These are all about making it easy to run long running training jobs on a hosted environment. Typically, that means providing some manifest that specifies which steps sre to be done and in which order. That's a manifest for a training pipeline. As an example, a pipeline might have as its step an action to download data from wherever it's stored. That could be the first step. Then it gets split into training and validation data. The training data will then be used to train the model, and the validation data will be used as a check on the quality of the model's predictions. When we check the quality of the predictions, we'll want to record those checks somewhere and ideally also have an automated way to decide whether we should consider this as a good model or not. If we do consider it a good model, then we'll probably want to serialize it so that the serialized model would be available for promotion to a running environment. This is probably sounding rather like continuous integration pipelines. It is similar, but also different. The difference can be seen in the specialized tools dedicated to training. One tool for handling training is Kubeflow pipelines. In Kubeflow pipelines, you can define your pipeline with all its steps, and also visualize it and watch it progress and see any steps that fail. But the pipelines aren't only called pipelines, they're also called experiments, and they're parameterized, so there's options in its UI where you can enter parameters. Remember I mentioned before that the process can be tunable. There are tunable parameters on training, such as the step size, so you can kick off runs in parallel of the same pipeline using different parameters to see which parameters might result in the best model. Cube flow pipelines is not alone in having this idea of being able to kick off runs of an experiment with different parameters. MLflow, for example, uses the same terminology and has a similar interface. So there's similarity here with traditional CI systems, as the training platforms execute a series of steps and an artifact gets built. But it's different, as you've also got this idea of running experiments with different parameters to see which is best. That means you have to have a definition of which is good, which is the best model, whereas traditionally with continuous integration you would just be building from master, and if it passes the tests, then you're good to promote. But so long as you can automate, which counts as the best model, then your training can build an artifact from a promotion, much like with CI. And sometimes these CI systems do have integrations to rather, sometimes these training systems do have integrations available to CI systems let's say we've got a way of building our model and we want to be able to serve it. So we want to make predictions available in real time via HTTP, perhaps using a rest API. We might use a serving solution, as there's a range of them out there, some that are particular to a machine learning toolkit such as Tensorflow serving, Tensorflow or Torch SRE or Pytorch. There's also serving solutions provided by cloud providers as well, some that are more toolkit agnostic. For example, there's the toolkit agnostic open source offering that I work on. Seldom. Typically, serving solutions use the idea of a model being packaged and hosted, perhaps in a storage bucket or a disk location, so the serving solution can then obtain the model from that location and run it. Serving solutions often come with support for rollout and some support from monitoring as well. As an example of a serving solution, I'll explain the concept behind Seldon and how it's used. Seldon is aimed in particular at serving on kubernetes, and the models are served by creating a Kubernetes custom resource. The manifest of the custom resource is designed to make it simple to plug in a URI to a storage bucket containing a serialized model. So at a minimum, you can just put in the URI to the storage bucket and specify which toolkit was used to build the model. Then you submit that manifest to Kubernetes and it will create the lower level Kubernetes resources necessary to expose an API and serve the model's HTTP traffic. There's also a docker option to serve a model from a custom image. I'm emphasizing the serialized or pickled models in this talk, mostly because it's common to see those with serving solutions, and it's not very common outside of the mlops space. The serving stage links into rollout and monitoring I've talked a little bit already about a b testing as a rollout strategy. With that strategy, the traffic during the rollout is split between different versions, and you monitor that over a period of time until you've got enough data to be able to decide which is best. There's a more simple rollout strategy, which also involves splitting the traffic between different versions. With the Canary strategy, you split traffic between the live version of a model and a new version that you're evaluating. But typically with a canary, you just have one new model, and you evaluate it over a shorter period of time than with the A B test. It's more of a sanity check than an in depth evaluation, and you just promote if everything looks okay. Another strategy is shadowing. With shadowing, all of the traffic goes to both the new and the old model, but it's only the responses from the older model, the live model's responses that are used and which go back to the consumer. The new model is called the shadow version, and its responses are just stored. They don't go back to any live consumers. The reason for doing this is to monitor the shadow and compare it against the live version so it makes sense to be storing the shadow's output for later evaluation. Serving solutions have some support for rollout strategies. In the case of Seldon, for example, you can create a Kubernetes manifest with two sections, one for the main model and one for the Canary. The traffic will automatically be split between these two models by default. Seldom will split traffic evenly between the models. In a manifest, you can set a field against each model called traffic. That field takes a numeric percentage that tells seldom how much of the traffic each model should get. Each rollout strategies involve gathering metrics on running models. With seldom, there's out of the box integration available for Prometheus, and some Grafana dashboards are provided. These cover general metrics like frequency of requests and latency. You may also want to monitor for metrics that are specific to your use case, and there are defined interfaces so that extra metrics can be exposed in the usual Prometheus way. I mentioned earlier that in the shadow use case, you might want to be recording the predictions that the shadow is making so that you can compare its performance against the live model. This can be handled through logging all of the requests and responses to a database. There are other use cases as well where recording all predictions can be useful. For example, if you're working in a compliance heavy industry and an auditor requires to know of every prediction that's been made in the shadow use case, you'd use that database then to run queries against the data and compare the shadow's performance against that of the live model. In the case of Seldon, there's an out of the box integration which provides a way to asynchronously log everything to elasticsearch so that everything can then be made available for running queries on later, but without slowing down the request path of the live models. This idea of taking the live request and asynchronously sending it elsewhere can also be useful for some monitoring use cases, and not just for audit. In particular, there's some advanced monitoring use cases that relate to the data that's coming into the live model, how well it matches to the training data. If the live data doesn't fall within the distribution of the training data, then you can't be sure that your model will perform well on that data. Your model is based on patterns from the training data. So data that doesn't fit that training distribution might have different patterns. One thing we can do about this is to send all the request data through to detector components that will look for anything that might be going wrong so that we can flag those predictions if we need to. So let's drill a bit, a little bit further into what we might need to detect. One thing we might need to detect is an outlier. This is when there's the occasional data point which is significantly outside of the training data distribution, even though most of the data does fall within the distribution. Sometimes models express their predictions using a score. So, for example, classifiers often give a probability of how likely a data point is to be of a certain class for the model. You might expect to get a lower probability on everything when the data points are outliers. Unfortunately, it doesn't work that way. And for outliers, sometimes models can give very high probabilities for data points that they're getting completely wrong. This is called overconfidence. So if your live data has outliers and your use case has risk associated with those, then you might want to detect and track outliers. Depending on your use case, you might choose to make it part of your business logic, for example, to handle outlier cases differently, perhaps scheduling a manual review on them. Worse than the outlier case is when the whole data distribution is different from the training data. It can even start out similar to the training data and then shift over time. Think, for example, of the fashion recommendation. The example that we mentioned earlier was trained on data from winter, and then you continue using it into the summer. Then it's recommending coats when it should be recommending t shirts. If you have a component that knows the distribution of the training data, then you can asynchronously feed it all of the live requests, feed them into that component, and keep a watch. That component will keep a watch so you can use it to set up notifications in case the distribution shifts. You could then use that notification to decide if you need to train a new version of the model using updated data. Or perhaps you've got other metrics that let you track model performance, and if they're still showing as good, you might just choose to check those metrics more frequently while you look more closely into what's happening with live data distribution. These monitoring and prediction quality concerns also feed into governance for machine learning. It's a big topic and we can't go into everything in detail, but I want to give an impression of the area, so I'll at least mention of a few things I've talked about. Detection for data drift and outliers another thing detectors might be applicable for is adversarial attacks. These are when manipulated data is fed to a model in order to trick the model. Think, for example, of how face recognition systems can sometimes be tricked by somebody wearing a mask. That's a big problem in high security situations, and there are analogous attacks that have appeared for other use cases. I also mentioned that in high compliance situations you might want to record all of the predictions in case you need to review them later. This can also be relevant for dealing with customer domains. This relates to the topic of explainability. For example, if you've got a system that makes decisions on whether to approve loans, you're deploying somebody a loan. Then you might want to be able to explain why you denied them the loan. You'll want to be able to revisit exactly what was fed into the model. The explainability part is a data challenges data science challenge in itself, but it links into mlops because you'll need to know what data to get explanations for and what model was being used to make the original loan decision. The topic of explainability also relates to concerns about bias and ethics. Let's imagine that your model is biased and is unfairly denying loans to certain groups. You'll have a better chance of discovering that bias if you can explain which data points are contributing most towards its decisions. There's also a big governance question around being able to say exactly what was training and when. In traditional DevOps, it's a familiar idea that we'd want to be able to say which version of the software was running at a given point in time and what code it was built from, so that we can delve into that code and build it again if we need to. This can be much more difficult to achieve with mlOps, as it would also require being able to get access to all the data was used to train the model, and likely also being able to reproduce all of the transformations that were performed on the data and the parameters that were used in the training run. Even then, there can be elements of randomness in the training process that can scupper reproducibility. You don't plan for them. So let's finish up by summarizing what we've learned. MLOPs is a new terrain. ML workflows are more exploratory and data driven than traditional dev workflows. MLOPs enables ML workflows. It provides tools and practices for enabling training runs and experiments that are very data intensive and which use a lot of compute resources. Provides facilities for tracking artifacts produced and operations on data during those training runs. There are MLOps tools specifically for serving machine learning models and specialized strategies for safely rolling out new models for serving in a production environment. There's also tools and approaches for monitoring models running in a production environment and checking that model performance stays acceptable, or at least that you find out if something does go wrong. So that's my perspective on the field of mlops right at least as it is right now. Thanks very much for listening.

See all 10 talks at this event!

Conf42 Site Reliability Engineering 2020 - Online

August 27 2020 - premiere 5PM GMT

Machine Learning in Production: An Intro to MLOps

Video size:

Abstract

Summary

Transcript

Ryan Dawson

Core Member @ Seldon Open Source Team

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering 2020 - Online

August 27 2020 - premiere 5PM GMT

Machine Learning in Production: An Intro to MLOps

Video size:

Abstract

Summary

Transcript

Ryan Dawson

Core Member @ Seldon Open Source Team

Join the community!