Machine Learning Engineering Done Right: Designing and Building Complex Intelligent Systems and Workflows with Python

Video size:

Abstract

It is not an easy task to design and build systems that involve Machine Learning and Data Science requirements. In addition to this, managing the complexity of intelligent systems requires careful planning and execution. In this talk, I will share the different strategies and solutions on how to design, build, deploy, and maintain complex intelligent systems and workflows. I will discuss how different concepts like Metaprogramming, Infrastructure as Code, Continuous Integration and Deployment, and Architecture Patterns work in the real world and how these concepts are used in a practical setting.

We will talk about how to use Python with different tools and services to perform machine learning experiments ranging from fully abstracted to fully customized solutions. These include performing automated hyperparameter optimization and bias detection when dealing with intermediate requirements and objectives. We will also show how these are done with different ML libraries and frameworks such as Scikit-learn, PyTorch, TensorfFlow, Keras, MXNet, and more. In addition to these, I will also share some of the risks and common mistakes Machine Learning Engineers must avoid to help bridge the gap between reality and expectations. While discussing these topics, we will show how containerization and serverless engineering helps solve our technical requirements.

While discussing these concepts, tools, frameworks, and techniques, we will provide several examples and recipes on how these ML workflows and systems solve different business requirements (e.g., finance, digital transformation, automation, sales).

Summary

Joshua Arvin Lat is the chief technology officer of Nuworks Interactive Labs. He is also the author of a machine learning and machine learning engineering book on AWS called Amazon Sagemaker Cookbook. Today, we will talk about ten things to do when designing and building complex intelligent systems with Python.
The goal here is for us to count the number of apples in this slide. If you guess it correctly, I may give you a price. Sometimes when we are dealing with technical requirements, we become too focused on what we're doing. Being able to listen to the needs of the customers is the number one priority as a professional.
There's different ways CTO use Python. The second topic would be on knowing when to write production level Python code. The third one would be to enforce practical Python coding guidelines for your team.
This is very important for ML engineering managers or maybe data science leaders. Having rules allow people and professionals in your team to have some sort of way to accomplish their work. The fourth one would be writing testable Python code.
The next one would be on utilizing continuous integration and deployment pipelines. Number six would be making the most out of ML frameworks and ML platforms. These are actually three options, not just two. Being familiar with one or two or three or more ML platforms would help your company accomplish its goals in a much faster way.
You can also make the most out of, let's say, sagemaker clarify to automate bias detection. The same way with ML explainability. The more we are able to explain a model better, the more we can get an organization to use that model.
automated hyperparameter optimization. Hyperparameters are configuration parameters that you can set before the training job. Having transient ML instances to run your training jobs is super important when managing cost. When securing machine learning environments, it's critical that you take care of both process and tech side of things.
This can also be used when dealing with resources in the cloud. If you were to use a model from an untrusted source, and that model may run, let's say, arbitrary malicious code. These can be solved by limiting the permissions for that set of resources loading this model.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Good day, everyone. The topic for today would be machine learning engineering with Python. The title of my talk is machine learning engineering done right, designing and building complex intelligent systems and workflows with Python. So first I will introduce myself. I am Joshua Arvin Lat and I am the chief technology officer of Nuworks Interactive Labs. I am also an AWS machine learning hero, and I'm also one of the subject matter experts who has helped contribute to the AWS certified machine learning specialty exam. So if you were to take that exam, most likely one of the questions there was probably from me. I'm also the author of a machine learning and machine learning engineering book on AWS called Amazon Sagemaker Cookbook. So Amazon Sagemaker is a machine learning service and platform from AWS where you can perform experiments and deployments. So you can use your favorite machine learning and deep learning framework and still use that with Sagemaker. And you're able to make the most out of Sagemaker by using a lot of its capabilities to help you get your machine learning experiments and deployments successful. Today, we will talk about ten things. The first one would be understanding the needs of the business and the customers when dealing with machine learning and machine learning engineering requirements. The second one would be about knowing when to write production level python code. The third one would be enforcing practical Python coding guidelines for your team. The fourth one would be using Python design patterns and metaprogramming techniques. The fifth one would be utilizing continuous integration and deployment pipelines. The 6th one would be on making the most out of ML frameworks and ML platforms. The 6th one would be working with automated ML Bias detection and ML explainability capabilities. The 8th one would be on reaping the benefits of cloud computing for automated hyperparameter optimization jobs. So of course we'll explain what HBO is when we talk about that slide. And then number nine would be optimizing cost by using transient ML instances for training models. So later we'll talk about a quick example on how to fine tune Bert models later when using Sagemaker. And the number ten would be securing machine learning environments. So without further ado, let's have a quick game. You can see a bunch of apples. So in this game, if you guess it correctly, I may give you a price. So basically, how would this game work? So within the next ten to 15 seconds, the goal here is for us to count the number of apples in this slide. So again, within the next ten to 15 seconds, I want you guys to count the number of apples in this slide. So, timer starts now I'll have a quick countdown. Ten. 9876-5432 and one. All right, so time's up. So again, the goal is to count the number of apples. So if you have answered, let's say, 18. Drumroll, please. That's incorrect. Unfortunately, that's not the correct answer. So how about 2020 apples? Unfortunately, that's also incorrect. So what's the correct answer here? The correct answer here is that it's not possible to count the number of apples in this slide. So that's sad news for all of us. So the question is, why? So the first thing here, if you look at the screen, is that we cannot see the apples underneath this first layer of apples. And the same way goes when dealing with our day to day jobs. Sometimes when we are dealing with technical requirements, when we're using these awesome tools and frameworks to solve our jobs, the problem there is that we become too focused on what we're doing, and we tend to forget what the business and the customers need. So the technique here is to first, let's listen, let's understand what the context is, because maybe we may be able to provide the best solution without any coding work at all. And there may be times when we can just use a specific AI or machine learning service where with ten lines of python code, you would be able to solve the customers problems. So being able to listen to the needs of the customers and being able to listen to the needs of the business, that's the number one priority that you have to think about as a professional. You do not have to be a manager or a boss to know about these things, because if you're working on something, you need to make sure that your customers are winning and the business is winning as well. All right, so the second topic would be on knowing when to write production level Python code. So, of course, for us who have been working and using Python for the past couple of years, you are probably aware that there's different ways CTO use Python. Let's say that you are a data scientist and you want to explore the data and to show a couple of charts showing the properties and basically the relationships of the data points in your data set, then that would fall under machine learning experiment, and you may use tools like Jupyter notebook to demonstrate and show the output of your Python code. So there you may not need engineering techniques to work on your Python code managing. Even if you're not following certain set of rules, then that's okay, because that's just for demonstration purposes. But when you need CTO work on systems, then definitely you have to follow the engineering techniques and guidelines to get that to work. So, for example, if you were to build a machine learning prediction endpoint using flask and Python, then you would need to follow, let's say peP eight or the other coding guidelines, as well as applying the engineering techniques to make sure that your website or your endpoint is always up and running and it's going to return a response in less than 1 second, for example. So making sure that your python code is clean is essential when you're working on engineering tasks. The third one would be on enforcing practical Python coding guidelines for your team. So now let's talk about how do you manage teams. So this is very important for ML engineering managers or maybe data science leaders, right? So let's say that you have a data science team and you have a team there focused on building machine learning, engineering platforms or endpoints, then this is for you. So what has worked for me in the past is when we were building a machine learning endpoint for a product, I realized that it's going to be a bit tricky when dealing with multiple developers and engineers. Right? So the goal there, before you can actually perform code reviews, is that it's better to set standards for the company. So if you're a CTO, then this is going to be one of your roles, because having rules allow people and professionals in your team to have some sort of way to accomplish their work, right? So if you have rules, if you have standards, then that's going to help your people perform their jobs better. So one of the rules that definitely has helped me in the past would be the 20 line rule. So here we have something like the maximum number of lines inside a python function or method. So let's say that you have a function called load model. If the number of lines in that function exceeds 20 lines, let's say 25 lines, then you have to divide that function into, let's say three or four sub functions. This allows your code to be cleaner and more organized. The second one would be following the pep eight guidelines, or a similar set of guidelines when using Python. So having something like that would definitely be helpful for your team, regardless if you're trying to build a machine learning platform or not. So if you're using Python, try to look at pep eight. The third one would be avoidance of try catch blocks. So why? The goal here is to be able to detect errors as early as possible. The problem when using try catch blocks, if you're not careful enough, is that sometimes if you have a transformation and then you just wrap a transaction with try catch blocks, sometimes the error disappears and you may not have the ability to debug the problems when you're dealing with production endpoints and environments. So here, let's say that you have 10,000 transactions, and then for some reason in your logs, you only have 9950 records there. So what happened to the 50 records? What went wrong? So the goal there is to be able to have some way to know what happened to those 50 transactions. So if you don't have logs and if you suddenly use a try catch block to prevent that endpoint from failing, then you would have no way to debug what went wrong, and those records and transactions may be lost. The fourth one would be writing testable Python code. So when you're building systems, it's very important for us to know that it is an iterative process. So when you're writing function, you're not just writing one big block of code. You want to write functions and methods and classes that allow you to easily debug this code, let's say with a console. So it's not just about having a web application ready there, it's about having a console also to easily debug how a function behaves. So even if you're not practicing automated testing inside your company, then at least make your code testable. So try to take a look at that. The next one would be using Python design patterns and metaprogramming techniques. So I won't discuss here all the different Python design patterns and metaprogramming techniques, but I would mention some of the recommended goals and techniques that you can use in your company. So one example of this would be to write your own convenience library that wraps and abstracts certain operations. This is especially useful when you're working with a larger team and when you're using a lot of tools and sdks to perform your job. So let's say that you have a senior engineer and then you have a junior engineer. Then you can have your senior engineer work on this so that that senior engineer can, let's say, prepare a convenience library that works something like, let's say an orm. So something like SQL alchemy, where some Python classes and objects would help you perform your job better. And the junior developers or the mid level developers would not need to care about the internal details or the abstracted automation parts when working with your convenience library. So you can make use of design patterns and metaprogramming techniques to speed up the work and also abstract the unnecessary details from your other developers and engineers. Of course, perform this or do this when it makes sense. So if you're going to spend three weeks to work on this and your project is going to last for four weeks, these, that may not be the best use of your time. But if you have a super amazing engineer in your team who can work on something like this for two days, and then you can make the most out of those two days worth of work for the next three weeks, then that's a good use of time. The next one would be on utilizing continuous integration and deployment pipelines. Of course, at the start, you will be working on these things manually in the sense where, okay, I need to copy my model, put it inside a container or something, and then deploy it inside AWS lambda. So AWS Lambda is a service where you can write python code and then deploy it in that function as a service service. So there, the advantage there is that you only pay for what you use in AWS Lambda. So enough about AWS Lambda. Let's talk about this topic. So when you're building something, it usually takes three or four steps to come up with a deployment package. Of course you want that deployment package to be final and tested and working. And also when you're performing multiple deployments, let's say, per week, and there are a lot of users already using your system, we should find a way to make sure that the deployment package is 100% stable. Or if we detect that there's something wrong with that deployment package, then we should be able to roll back and revert to a previous deployment package. So knowing about continuous integration and deployment pipelines and all these other alternatives similar to that one would help your team work on these types of requirements better. So this is going to be super helpful, especially when your team is growing and when you want to enforce standards. So what happens here is that when one of your python engineers is working on some code, that person pushes some code to a repo, and then the integration pipeline activates it, performing some test, and then maybe at some point there's going to be an approval manual approval portion there where the engineering manager can just click on yes, after reviewing the results, and then perform the deployment. All right, so we're halfway through. We're almost done. Number six would be making the most out of ML frameworks and ML platforms. So these are actually three options, not just two. So the third option here is using, let's say, existing AI and ML services, where with five to ten lines of code, you may come up with text to speech, or maybe extracting text from images but for now, for the sake of complexity, let's talk about two things. The first one would be building everything from scratch, and then the second one would be using frameworks and platforms at the start, as developers and engineers, we always have that tendency to build everything from scratch. So when you are about CTO, learn an existing framework, there's always a tendency to say, oh, that's going to take me one week to two weeks to learn that framework. Let's say tensorflow or Pytorch, or Mxnet. And that's probably true. Sometimes the examples in the Internet may not work right away, or sometimes you just have the tendency to enjoy coding. When you're trying to learn about programming and machine learning and machine learning engineering, you can try learning these things yourself. But when you have to work with a team, and when you have to work in a company where the real things happen, let's say people resigning or people being replaced, and you trying to work on existing platforms and engineering systems, then you have to know that it's more practical in the long run to use machine planning frameworks and ML platforms. Of course, it may not always be the case, but being able to do both is the first step. And then the second step there is knowing when to use what. Because if you're going to build everything from scratch, these, of course, you have to make sure that all the requirements and the potential hidden features may not be supported in your custom code, and it might take you longer to build that. So being familiar with one or two or three or more ML frameworks and tools and platforms would definitely not just help you, but help your company accomplish your goals in a much faster way. If you were to use an ML platform, let's say, with Sagemaker, then you can also make use of its existing capabilities and features. Because for one thing, when you're running machine learning workflows and workloads in the cloud, you will realize that some of those experiments will require bigger machines, and sometimes not just one, but two or three or more. So if you were to build this yourself, it might take you two to three months to build something that's super flexible and something that can easily evolve to more complex use cases. But if you were to use an ML platform, learning it, let's say, would be two to three days, and then using it would take an additional day. So that would be three days. All in all, instead of trying to build everything from scratch, where you will build it for like two to three months, only to realize, oh, there's no debugger, there's no model monitor and all the other high tech features that the platform or the framework has already provided. So here let's say that we want to modify the number of computers or servers or instances that we will use for training the model and performing hyperparameter tuning jobs. Then here, if you look at the screen, you can see that, oh, you just set the parameter to six, and then you can just have six ML instances there. And then if you want just one instance for model deployment for that inference endpoint, then you can specify that as well with just a few lines of code. The advantage these is that the infrastructure is abstracted and you can just use Python and the objects and classes in the SDK provided CTO access and manage the resources. Here, in addition to using ML platforms and frameworks, you will only also find a lot of documentation online when using these tools and frameworks and trying to get this to work in different types of environments. So if you were to build things using your own custom code, sometimes these disadvantage there is that the errors are also custom. So when you try to look for the solution in, let's say, stack overflow, you may not find that right away unless you are very experienced. Here, you can also make the most out of, let's say, AWS lambda serverless. So if you are just calling an endpoint four times per day, then why have an instance for it? So with this one, you can technically get these almost for free. Because if you're just going to use AWS lambda for 4 seconds for a day, and it's under the free tier, then it's much, much cheaper than having an ML instance running there with your endpoint, right? With your deployed model. So you can use AWS lambda, let's say with scikit learn, you can use it with Tensorflow, and you can maybe deploy Facebook profit inside a lambda function, Facebook profit model inside a lambda function. And you can combine that with, let's say API gateway, which is a service that allows users to deploy an endpoint and then having that endpoint trigger the AWS lambda function that you have prepared. There are also a lot of deployment solutions out there. Of course it's super important if you know how to build this from scratch, but there are ways to speed up solving these types of problems in just a couple of hours. So let's say that if you were to build something for four weeks, maybe you can do the same thing in two to three weeks, especially if your team is already using that platform or these tools. So the first one would be deploying a model in an easy to instance. So that's one of the most customizable options out there. So if you want to build everything from scratch, then yes, you can deploy that inside an easy to instance or alternative using a different platform. The second one would be deploying the model in a container in an easy to instance. The third one would be using a built in algorithm for training and then deploying that in a sagemaker endpoint with, let's say ten lines of code. So this is very helpful when you're trying to provide proof of concept work to your boss before approving your machine learning project. The fourth one would be using custom containers. So building your own docker container images and then using that and deploying that in a sagemaker endpoint with just a couple of lines of code. The advantage there is that you can make the most out of an existing platform's features, let's say model monitor, to help you detect model drift. So what is model drift? Most of the machine learning practitioners are only aware of how to train, build and deploy a model, but in reality that model deployed in production may degrade over a couple of weeks or a couple of months. So being able to detect model drift and being able to replace that model is essential. And knowing that models really degrade over time is an essential item for veteran machine learning practitioners. A model can also be deployed inside a lambda function as shared earlier, and then we can also use lambda to trigger a sagemaker endpoint. So if you have deployed a model in a sagemaker endpoint, then you can use AWS lambda to perform some custom things first before triggering the sagemaker endpoint, giving you that flexibility, especially if you need to pre process your data first before performing the prediction. Also, you can use API gateway mapping templates with sagemaker so that there's going to be no lambda function between those two services. Here you just use VTL to map the input data to the sagemaker endpoint directly, and then you can also deploy the model inside Fargate. So Fargate is a service which can help you work on containers and container images in AWS. So feel free to use these concepts when you're using other platforms as well, because most likely they may have similar services out there, especially if your team is already using those platforms when you're using, let's say, Sagemaker. I would like to add that these are also combinations where for more complex use cases, you can technically deploy multiple models inside a single machine learning instance. So there are different use cases there. And you can also deploy sagemaker multicontainer endpoints where that endpoint can have multiple containers with different models. So let's say that you're using a custom model using a specific deep learning framework. Then you can have that model inside these container. And then let's say you have four containers these, then you can deploy that inside a single endpoint. So you can just select which container to use when you're performing inference. So very useful when you're trying to compare different models in a production environment. The next one would be setting up a b testing using production variants in sales maker. So let's say you are running a model deployed in an endpoint. You can do let's say 80 20 where 80% of the traffic is being handled by one model and then a new model is going to handle 20% of the traffic. And then you're going to compare the performance of those two models before trying to replace the first model with the second one. So of course if the second model is performing better than the first one, then that's the time to replace the first model. And then here you can also deploy the model inside a lambda function with containers. So recently I think about five months ago, AWS has released this feature where in addition to writing Python code inside a lambda function, you can use your own custom container images to load your model. So this is very helpful. Also when you're using deep learning frameworks and trying to get that to work with AWS Lambda. So here you can make the most out of both worlds where you can use your customization capabilities, especially your DevOps skills, to prepare that custom container image, loading the model and then it's going to work with AWS Lambda where you just pay for what you use. So if you're just using it for 3 seconds per day, then you're only going to pay for 3 seconds per day, which is super cool. You can also use, let's say the data science libraries with Sagemaker. So if you were to use that these you can easily build machine learning workflows using AWSF functions. So here you can see that you can automate the entire process. And if you want to perform model retraining, you can do something like this. So let's say that you have uploaded your files or your new data in a bucket or in a storage service, let's say s three, then this automation workflow would help you automatically trigger the training step and then evaluation and then deployment. And then if, let's say that code is performing better than your previous model, then you can automatically replace your existing model in production so, pretty cool, because you're not just stuck with manual steps, but you can just leave these running, especially if you want to work on other machine learning projects. You can also make the most out of, let's say, sagemaker clarify to automate bias detection. So here we can see that you have your data and maybe your model, and then you pass those as parameters. Cto your sagemaker clarify jobs. So, of course, what's bias? So, ML Bias is something that you would probably be aware of if you've been working in the industry for quite some time. And when you're deploying a model and using that in production, it's not just about performing these right prediction, it's also about making sure that you're following the guidelines and that making sure that your model is not bias towards certain groups. So, we will not talk about this in detail here, but it's important to note that there are a lot of metrics that you can check when checking and working with bias. Let's say class imbalance, or maybe DPPL, or maybe treatment and quality. And here you can fix your data after you have detected that your data has issues after reviewing these metrics here. So here's some sample python code when using Sagemaker clarify. So here you just specify, let's say, the instance count, the instance type, and then you pass in your data, and then maybe a few configuration parameters. So instead of you trying to learn how to detect bias and planning all these formulas yourself, why not just use a tools which can provide you the metrics right away, as you can see here. So if you were to detect, let's say, class imbalance with, let's say, ten to 15 lines of code waiting for ten minutes, then you would have something like this, where you can detect if there's class imbalance in your data set the same way with ML explainability. So what is ML explainability? So, if you are working with more complex algorithms and models, of course the output, these may not necessarily be easily explainable. And the more we are able to explain a model better, the more we are able to get an organization to use that model or algorithm. So, in ML explainability, let's say for this one, you can use shaped sales to explain your model. So here, how do we interpret this? We can say that out of the four features that we have in our training data set, we can say that only two features actually contribute to the final outcome of the prediction of the model. So, let's say that there's ABC and D. Only A and B actually contribute to the final outcome when performing the prediction. So this is an example of what we will get if we were to use Sagemaker, clarify to use and compute the Shaq values after you have passed your data. So the next one is a really exciting topic. So this is called automated hyperparameter optimization. So what is automated hyperparameter optimization? So the first one is understanding what hyperparameters are. Hyperparameters are configuration parameters that you can set before the training job. So one training experiment, one set of hyperparameters. Of course, when you're trying to create models, it's critical that we all know that after one experiment, we are not really sure if that model is these best model for that problem. So the technique there is to configure the hyperparameter values, perform the experiments again, and compare the evaluation metrics with the evaluation metrics of a previous model. So of course this would be very time consuming. So how do we solve this in a more practical manner? So, with cloud computing, you can easily spin up a lot of resources, let's say for three minutes each, and then perform one training experiment for each, let's say ML instance. So after 15 minutes, you would be able to come up with, let's say 15 different experiments and 15 different models, and come up with a fine tuned model where that model has the best metric values compared to the other models produced by the tuning job. In a similar fashion, you can perform automated hyperparameter tuning across different model families. So let's say that you have here in the screen, you have a custom algorithm using Apache Mxnet deep learning framework, and the second model family would be using the linear learner built in algorithm. Then you can perform a single hyperparameter tuning job where this model family would use a certain set of configuration hyperparameter ranges, and then the second family would use a different set of configuration parameters. Those training jobs would run, and then the best model would be used in the final model deployment step here in optimizing cost by using transient ML instances for training models, we can make the most out of transient ML instances where the ML instances would run for, let's say ten minutes, and then it's going to turn off automatically. So this is very helpful when you're trying to train or fine tune existing models where you would need a lot of resources. The example of this one would, let's say, be using BErT models. Let's say you have hugging face and then you have BErt, and you would need, let's say p two x large instances, which are super expensive, right? But if you were to run that in just two to three minutes, then that's better compared to running that same large instance for 3 hours. So having transient ML instances to run your training jobs is super important when managing cost. Finally, when securing machine learning environments, it's critical that you take care of both process and tech side of things. So knowing about principle of least privilege is important because of course, when you're preparing your environments, you have to prepare and manage the security configuration first and make sure that from the beginning this is properly set up so that you can leave your engineers working without you having to worry about security every day. So set the rules, set the guidelines, set the restrictions so that these can only perform what they should be doing and does not apply only to humans. This can also be used when dealing with resources in the cloud. So here this is an example of a potential risk when using a library. So here, this library allows you cto load and save models. But if you were to use a model from an untrusted source, and that model may run, let's say, arbitrary malicious code, when you load the model these technically your system has been compromised. So what can you do here? These you can solve this problem by limiting the permissions for that set of resources loading this model. So let's say that you have a container using Python loading this model from an unauthorized source, then you can limit that resource to only perform certain actions. So if, let's say, case one, you have super admin permissions for that resource, and that model has been loaded in that resource, then the problem there is that malicious code can perform super admin actions. On case two, if that resource loading that model has limited permissions, then the advantage these is then the malicious code can only perform a limited set of actions as well. So at least you can limit the damage when an accident happens. So that's pretty much it. So thank you again for listening to my talk. So again, you have learned a lot in this short session, so make sure to use that knowledge in your day to day machine planning life. So thank you again and feel free to reach out to me via email or LinkedIn. So thank you again and hope you learned something from my talk.

Slides

Download slides (PDF)

See all 29 talks at this event!

Conf42 Python 2021 - Online

May 27 2021

Machine Learning Engineering Done Right: Designing and Building Complex Intelligent Systems and Workflows with Python

Video size:

Abstract

Summary

Transcript

Slides

Joshua Arvin Lat

CTO @ NuWorks Interactive Labs

Join the community!

Featured event

2025

2024

Info

Conf42 Python 2021 - Online

May 27 2021

Machine Learning Engineering Done Right: Designing and Building Complex Intelligent Systems and Workflows with Python

Video size:

Abstract

Summary

Transcript

Slides

Joshua Arvin Lat

CTO @ NuWorks Interactive Labs

Join the community!