E2E ML Platform on Kubernetes with just a few clicks

Video size:

Abstract

Kubeflow is a machine learning toolkit for Kubernetes where users can develop, deploy, and manage ML workflows in a scalable and portable manner. Deploying and maintaining it can be a bit tricky since Kubeflow is composed of many components such as notebooks and pipelines and their potential configurations. This makes the barrier to entry to Kubeflow very high and make it difficult for teams to adopt Kubeflow.

To help alleviate some of these deployment woes the Kubernetes Kubeflow Operator was created. It automates the deployment, monitoring, and management of Kubeflow as a whole. In this session, users will learn how they can best leverage the Kubeflow Operator to quickly get Kubeflow up and running on their Kubernetes clusters.

Summary

In this talk, we'll be talking about Qflow and how you can get an endtoend machine learning platform with just a few clicks. Most of the effort is still spent on that small, tiny block in the middle. But now more and more, it is a core part of every business.
MLOps is these ability to apply DevOps principles to machine learning applications. If you're building software in 2021, I'm hoping you have embraced some of the DevOps practices. MLOps is trying to bring those same principles into the lifecycle of your machine learning models.
MoFI is a software engineer and developer advocate at IBM. Most recently, he has been contributing to the Qflow upstream project into the manifest and deployment special interest group. If you have any questions after the conference, you can reach out to me.
An end to end machine learning platform covers everything from the start of the data to actually using that model into something useful in our application in our business. The pros of using one of the major cloud providers would be it is fully managed and works well with other cloud services.
Qflow is an open source project that contains a curated set of tools and frameworks for machine learning workflows on kubernetes. Qflow is scalable, composable, portable, is open source, is industry supported and it's multitenant.
Kubeflow is an open source project and it's rapidly growing. Goal is to improve accountability for maintaining components. Manifests increase modularity. We have improved some of the Kubeflow deployment strategy. First time users should have a much smoother experience.
Qflow uses app id as authentication mechanism. You can define your pipelines using a Python DSL. On the same kubernetes cluster you could have multiple team members working simultaneously. In pipelines we can make use of those platforms to build our models in many ways.
Thank you so much for joining me in this session. Go to qflow. org to learn more about qflow and get started with Qflow. If you have any question to me, you can reach out to me at codes at any of the social media.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Learning. In this talk, we'll be talking about Qflow and how you can get an endtoend machine learning platform with just a few clicks. My name is Mophi, I'm a software engineer at IBM and a contributor to the Qflow project. So if you want to follow along and find the slides, the link to these slide is this slide. So the link is tiny ccml k eight s. So once again it's tiny tc e two e ML khs. So not too long ago, we can all remember machine learning and AI was a novelty, right? Like companies were spending a lot of time and spending a lot of money researching a lot of these machine learning ideas. And it was not necessarily a core part of your business. Some companies were doing it well, some companies were doing it a little bit more sporadically. IBM, where I work at, even ten years ago, machine learning was a huge research topic and we were spending hundreds of millions of dollars every year. But it wasn't directly making any business impact. It was more of a theoretical side of business where you have a lot of data, you just don't know what to do with them, and you just try different things out. You try to gather some information from the data. Well, if you look at the whole machine learning ecosystem, there is a lot of different things that needs to happen for machine learning code to become useful. But most of the effort is still spent on that small, tiny block in the middle. Machine learning code. Engineers love data. Scientists love their machine learning code. We write a lot of code in Python R and like Tensorflow and all of these things which like spending a lot of time in the middle where to make your machine learning code valuable, you actually need to spend a lot more time on these things around it, right? You need to spend bit time on data verification, analysis, you need to serve the model you create, you need to monitor the model so that you know the thing it's doing, it's the right thing to do. So now more and more, it is a core part of every business. So just the machine learning code is not good enough to be able to serve. The underlying business is trying to improve by using machine learning and AI. Data is everywhere. We generate more data now more than ever. And understanding what that data is and what that data means is more important than ever. It's estimated in the next ten to 15 years, our data ingestion and data creation will quadruple or go to an exponential number. But if we don't understand what data means, that data means to us, we are just like wasting time and money by generating more data, we're not actually getting anything valid out of it. So with that, I think because machine learning is becoming more and more mainstream and becoming more and more core part of our business, comes the rise of mlops. If you are not familiar with what MLOps is, well, you can ask, what is MlOps? So MLOps is these ability to apply DevOps principles to machine learning applications. And this is the definition given by the MLOps Special Interest group from CD foundation. So what MLOps is trying to solve is that right now, machine learning, even to this day in a lot of companies and a lot of organizations, is a sporadic thing. It's done by a data scientist. It's almost treated like research. It's done almost as in an educational capacity, and we want to make that into a core part of your model. If you're building software in 2021, I'm hoping you have embraced some of the DevOps practices. Where you have your code, your code is being continuously tested, your code is being integration tested, then your code is going through some sort of deployment, continuous delivery model, where your code goes to your version control, then gets tested, gets built, gets pushed to production, gets tested again, and you can have rollback and all that features. And MLOps is trying to bring those same principles into the lifecycle of your machine learning models as well. Well, with that introduction, I am MoFI. I'm a software engineer and developer advocate at IBM, and I mostly do container things. So if you haven't figured it out by now, I am not from these world of machine learning. I'm not coming into this from the perspective of a data scientist. I come from the world of infrastructure and containers. Most recently, I have been contributing to the Qflow upstream project into the manifest and deployment special interest group. And if you need to find me in a social media later, I can be found at movie woes in any of the social media above, mostly Twitter. So if you have any questions after the conference, you can reach out to me. Feel free to reach out to me at moviecodes in Twitter. So the title of the talks about end to end machine learning platform, so what does that even mean? And why do we care about an end to end machine learning platform? So end to end machine learning platform without having going deeper into what it means. Let's just talk about what we want an end to end machine learning platform to have. And when we say end to end machine learning platform, we're talking about something that covers everything from the start of the data to actually using that model into something useful in our application in our business, the very first step, or even like one of the earlier step, is data ingestion, right? So you have data being produced either by you collecting the data from your application or user just sending your data, or by some other mean, you're just collecting some data. And this is the part a lot of companies are still in trends. There are some graphs and machine learning and data analysis actually stops for a lot of people at that stage. But if you want to do anything useful with the data, if you want to build some intelligence around the data, you need to take that data and clean it up, transform that data into something usable, validate the data so that you know that this data makes sense for your larger subset of the larger structure of the whole data. Then you are just bidding the data for training. You are building models, you're validating models and training that model into like a distributed training to create this model that you can use. This is where we see a lot of these gaps in the theoretical structure of machine learning versus the business use cases that we have. Oftentimes data scientists are building these models, building this training into their own machine, and then testing out that this works. This is great. But the true value of machine learning only can happen when we are serving that model, like rolling that out into our application using that intelligence, and then continuously monitoring that model's performance and logging that out and making that a whole loop. Right? We're not just stopping at, oh, yeah, I have built a model. Great, now our problems are solved. We actually now have to say, we have built a model. We're serving this model. We are getting information from the serving that the data we are getting makes sense, the performance makes sense. We're continuously improving that model. So an end to end machine learning platform ideally would give us all these things, right? And some parts it will probably do better than others. And you also want to make sure that we have a way to continuously improve. So end to end, not just teams that we have all the features and we're good to go. I think end to end should give us a way to continuously improve in each of those steps. So there are many commercial end to end machine learning offering out there, and almost all major cloud providers has one. Other than that, you also have a bunch of third party companies just providing machine learning platform as a service you can use. For example, you have Google Cloud AI platform. I happen to work for IBM. We have couple, we have IBM cloud tech for data. Watson Studio, AWS happened to have Sagemaker and many other services, and you have azure machine learning platform. And again, as I said, there are many more third party companies providing this as a service. The pros of using end to end machine learning platforms from one of the cloud providers would be it is fully managed, you don't have to do all the bells and whistles are included with it. It works well with other cloud services. So if you are already a customer of Azure, using the Azure's machine learning service will work pretty well with the other Azure services that you have like Azure pipelines and Azure Kubernetes and all the other things will kind of work together. And the same goes for any other cloud services. And because you are already with a major cloud provider, your machine learning platform itself is also cloud scale. You have an easier time just scaling everything up. And you also have enterprise support, right? You are working with a cloud provider and you are taking an enterprise plan. You definitely have a lot more support with it. So if something goes wrong, you have someone to talk to or get support from. Some of the cons, I would say of an end to end machine learning platform, that if you are going with a cloud provider in this sense it's going to be expensive, obviously, because it's a pretty hefty cost. You have to consider if you want to go down that route, you definitely have these chance of getting vendor locked in. One of the reasons that would happen is you are not necessarily doing machine learning, you're doing AWS machine learning. You're not necessarily like if you're whatever vendor you go with, you will have a version of okay, we are doing this tool that Google thought that's how it should work, then you build your infrastructure around it and what at some point you find out that you can't change it without having major rework if you want to move out to someone else. So you have this problem of vendor lock in, and because your environment is managed by the cloud, it is not always super clear how you could have a local environment that replicates exactly how the cloud environment works. Oftentimes some of these components are not even open source, so you have no idea exactly how they work under the hood. So you have to kind of do everything on the cloud or you can do something on the local machine. These you run the risk of not having a matching environment between cloud and local. Obviously many of those tools are not open source in the vendors, so you don't really have a way of kind of doing it on your own or having a look at the source code or improve the code. If you see anything problematic, you are kind of dependent on the cloud to continuously make that improvement. And code and models are usually, sometimes not. Most of the times are not portable because you're building them in the cloud with their proprietary software. So it's going to be difficult for you to take your code and model and do it somewhere else. It's not true for every provider and not every cloud, but there is a high chance in a managed environment that could happen. So why not DyI, right, a lot of open source tools are out there. Tensorflow is open source. There are a bunch of other things, if you look at the whole end to end things that you need many of the tools that are out there that are open source. So definitely you could do something yourself using the open source. Or it is also possible for you to write everything from scratch. And it is not the first time industry in the industry have done so. Companies like Uber, Netflix, Airbnb, Lyft and many more have actually rolled out their own solution internally. So I think Uber has a software called Michelangelo and that kind of covers very Uber specific problems and in a very Uber specific way, it just solves those machine learning problems by building can entire platform internally at Uber. So if you are going down that route, some of the pros you can look at is you have full control over the platform, right? You are building the platform for your company and your exact needs. So everything you will build would be custom made for you because it's owned by you. Usually you would not have any vendor lock in, right? You would decide exactly how you can run it, even if you need to switch cloud providers. I think because you own the software and the platform itself, you will just need to pay for the infrastructure if you're not running in your own data center and it's customized to your needs. So at least initially you would feel like this is the perfect solution because you makes that solution for the problem at hand. But some of the cons are it's expensive. So although bit might be less expensive because you are not paying for the service itself, but the engineering hour you would spend in building something ground up like this would be pretty expensive over time to manage. And you are on these hook if something goes wrong. This is a service you made. You have to make sure it is up to date with the latest things. And as things change, things progress. You will have to have a continuous upkeep of engineering hours to make sure this thing is like up to date. So as time passes, a platform becomes harder and harder to manage because you've built a platform to do some machine learning for your business, you end up managing the machine learning platform and now you don't have time to maintain your business. So there are some difficulty with DIY as well. Now let's take a step back and think what we would want from a perfect machine learning platform. End to end machine learning platform. Right? So number one is bit should be built on scalable infrastructure. It should be something that can scale to whatever we need it to scale to. It should use existing tools data scientists already use. So we don't want to make sure we create something so new that our data scientists have to have a huge learning curve and learn new things. Again, ideally, and at least in my opinion, should be open source so we know the community itself is improving and taking these project forward that we use, that we depend on, supported by the industry. We want to make sure that not only we're not the only person using it, that's good for us to get long term support. Also it's good for us to get long teams talent that knows these tools, enterprise support options. So of course you should have the option to DYI but DIY, but you also want to make sure that even when you are ready to go to the route of, I want to just pay some company to deal with some of the management issues, you want to have that as well. And finally, it should be portable. Your machine learning models near code should be portable for you to take anywhere you want to take it. Well, a lot of the things, and again, I probably gave it away early on, is this QFlow? Is Qflow the tool that covers all of these end to end machine learning needs that we have. And I would like to think that is bit right now we're going to talk about how QFlow fits all of this criteria that I want to have in my end to end machine learning platform. So it's an open source project that contains a curated set of tools and frameworks for machine learning workflows on kubernetes. And these Kubernetes is a keyword here because that kind of gives us a bunch of the features that we see here. So because it's running on. So Qflow is scalable, composable, portable, is open source, is industry supported and it's multitenant. So you can have the same Qflow environment you can use for the entire team, give them individual spaces and so you can run your experiments in an environment that is going to be very close to what your final destination of that product or project is. So scalable, it's built on top of kubernetes so scalability is built in, you can get scalability of the pods. You can also get the scalability of nodes. So depending on your installation of kubernetes, if you are doing it yourself or using a managed kubernetes, you can scale your Kubernetes to any number of pods to some reasonable amount. And also because it's built on top of kubernetes and Kubernetes has existing pool of skilled individuals that knows how the infrastructure works. So either you already are using kubernetes for other things in your company, or it is fairly easy to find folks with skills on kubernetes. So you can also think of using kubernetes as a means of scaling your teams that needs to use this platform, right. You can easily find talent rather than if you are building something very custom in house, you would have a harder time finding people that just knows these system. You have to hire someone, train these on the system. So you're spending a lot of cycles on building skills up where in this system you already have people that just know kubernetes already. And also hopefully by virtue of Qflow being an open source project, a lot of people would also know Qflow was a system composed. So going back to this slide, Qflow has ways to manage each of those steps by using different tools under the Qflow ecosystem. So we're going to look at a few of these today. But composable basically means you can use different parts of the tools under Kubeflow to kind of create this system that covers all of these things that we want in an end to end machine learning platform, portable. So Kubeflow, you can use Kubeflow from local to training environment, or from training environment to cloud, or from cloud to cloud. And your environment of Qflow underneath stays pretty much the same. We like to think that when you're running experiments on your local machine versus when you're running your training versus when you're running your cloud deployment, it kind of looks the same, right? Like this is what we want to happen. But usually what ends up happening is that our experiment environment is much smaller in scope. We are just like running maybe a Jupyter notebook or one Python file. Then in our training environment we are running that, but in a much beefier machine with GPU and other resources. Finally we go to cloud. Now we are dealing with a lot more things like we are dealing with IBM permissions, we are dealing with models, we are dealing with canary deployment, rollback rollouts. So our environments ends up looking a lot different when you are just like doing it ourselves. And although we would like to think like, okay, we have deployed it ourselves and we have tested it, the model works, but every single minor differences in each of our environment would end up leading us to getting into some outages, right? Because we probably haven't covered the differences between our experiment stage versus our staging, versus our staging and our cloud. And each of those differences ends up turning into something bigger later on, because machine learning is no longer a novelty. Machine learning is a core part of our business, and machine learning is no longer just research. Right? We have to use machine learning now, get true insight into our business to be able to stay ahead of the curve. So machine learning used to be something that you used almost as like getting can advantage before. Now machine learning is what you have to use to stay with the curve, because everyone else is using it too. So it's part of the core business. We quality control our software. We make sure that our software is not regressing from version to version. Then we need to actually quality control our machine learning artifacts as well. We can't just build the model on our local laptop and just deploy it into production just by copying some files over. That's not how we do software, and that's not how we can do machine learning either. So with Qflow, you can have a local environment of Qflow just in your laptop, or in a dev Kubernetes cluster somewhere. You could have a training environment with GPU also running Qflow. And finally, you can have the deployment Qflow also in the cloud. And now you have kind of limited the number of differences between your experiment, training and cloud environments because all of them are underneath using Qflow. So this makes your environment pretty identical to each other, and thus making your environment portable. These are some of the Kubeflow components we usually talk about. So first of all, you have the platforms, you have your clouds. It could be on Prem Kubernetes, or local Kubernetes, or any of the cloud providers. On top of that, you have Qflow application. There are a lot of names here. We're not going to be deploying, talking about all of them today. But some of the key things are here. We have istio as the network layer. We have Argo or Tecton as pipelines. So if you want to build Qflow pipelines, you use one of those things for machine learning tools. You have Jupyter notebook, MPi, Mxnet, Tensorflow, Pytorch, xgboost, and all these other things. So another view of the components. The very first thing we have on Kubeflow is a dashboard that lets us look at all the things that are on our Qflow environment right now. Next we have Jupyter notebooks and as of the latest version of Qflow, we have also a bunch of other server there as well like code server or RStudio. We have some of the machine learning frameworks like Tensorflow, Xgboost, Pytorch. For pipelines we have choice of Tecton or Argo. For serving we have seldon or we have KF serving. For machine learning metadata we have MLMD. For feature store we can use fist and for monitoring because it's running on top of istio, we can make use of Prometheus and grafana dashboards. Look at what's happening in a cluster as well as monitoring other models as the traffic routing is happening. So if you want to deploy Kubeflow today, you can head to the manifest repository and you can use customize and kubectl. That's the latest, most recent instruction on how to install Kubeflow and you can use that to install Kubeflow on your Kubernetes cluster or your local machine and mini cube. So manifest repository is structured in a way now where it's easy to kind of find what are the extra apps, what are the components common things in Kubeflow and what are the contribution from the community to Kubeflow. Back in about two weeks ago, when one two release was the main release of Kubeflow, the repository was a little bit more clusters. We had a lot more things on the top level so it was difficult to navigate. But with the new one, these release, we have improved some of the Kubeflow deployment strategy by changing up the order of the repository. So again, if you looked at Kubeflow before, this is what it used to look like. On the left, the structure. With the latest release we kind of went through and changed many of the structure by restructuring the things around. It still does the same exact thing, but it's just restructured to do things a little bit more cleaner. So how can this help? So goal is to improve accountability for maintaining components. Manifests increase modularity. You can pick and choose the tools you want to install pretty easily and we want to make sure the deployment experience is smoother. So first time users, you should have a much smoother experience. If you had tried Kubeflow in can earlier time, you can also users Kubeflow operator. So you have the operator that you can make use of and operator Kubernetes operator is built so that we have an easier time deploying Kubernetes resources. Again, I'm going to just skip through the operator part for a second, because you can use the Qflow operator to deploy Qflow to Kubernetes or Openshift, and you have documentation here about it here. But one thing I want to mention, Kubernetes Kubeflow is an open source project and it's not all sunshine and rainbows. Some of these difficulty with Qflow is that because it's open source and it's rapidly growing, there are definitely growing pains that we see because underlying components that are a bunch of them are also open source and have their own release cycle. So you have some challenges these, right, Qflow has many moving parts, each component has their own release cycle and upgrade path. So as a Qflow, if you're maintaining Qflow as a distribution for your company, it is going to be something, at least at this point. It's quite a big challenge right now. And we are as the Kubeflow team trying very hard to make sure that doesn't usually happen. And for the most part, as an individual end user of Kubeflow, you don't really see a lot of these problems, but as a maintaining, we see that quite a lot. Where underlying component is updating, we have to update all of that to make sure we're in the latest version and we're using the greatest and the latest changes into these underlying changes. So each of EE EE EE Ee ML platform kubernetes using from Azure, from IBM, from AWS, from GCP, each of them has small differences that kind of add up into the overarching Qflow deployment. So if you are using Qflow on miniq versus you want to move your Qflow deployment to AWS per se, it might not be the exact one to one change as we like it to happen, but for the most part it is still very similar to how you would use Qflow on your local machine versus on cloud the future. So the Qflow under three, well, I'm saying will, here it is already here. It already released about one week ago. And all the distributions like IBM and AWS and GCP right now are testing to validate that it works on the newest release. So if you are looking at Qflow in about a week's time, you should be able to go and users the one three Qflow on your favorite cloud platform. And because the manifest repo is being restructured, it's much easier to navigate. Okay, so some of the references, if you want to try out Qflow, please go to Qflow manifest and if you want to join the community for the slack for other mailing lists you should go to community and you can learn more about the operator framework and how operator QFlow operator can improve our quality of installing Qflow. But before we finish, I want to quickly show you the Qflow environment. So I have a Kubeflow deployment on IBM cloud and I'm using app id as authentication mechanism. By default, Qflow comes with Dex as an authentication. So IBM authenticating against the Kubeflow environment. And once I have authenticated I would be here into my Kubeflow. This is a Kubeflow dashboard you would see for the first time you come into and on the left side you can see some of the tools Kubeflow has and I have some of the notebooks created and the notebook servers as of Kubeflow under three. As we said, we have the VS code code space here as well as rstudio as well as node Jupiter lab here. And this is namespaced by pari user. So right now I am logged in as my email here. I can also log in with a different email account with my Google. So once I do this, once I log in, it will ask me to create a new namespace. And once I do that, now I am on the same cluster, IBM a different namespace now. So if I go to notebooks, the other notebooks was for the other user and they're not these. So on the same kubernetes cluster you could have multiple of the team members working simultaneously together next to each other. I'll actually go back to the other user because I had something smosh to show on that other user. These I would also have experiments that I can run. So I have run one experiment, it's a simple pipeline and this does like a coin flip and test condition based on something. You can define your pipelines using a Python DSL and it will run that pipeline was KFP tecton. You can use that to build your model and then users KF serving to serve that model and get the full pipeline of data ingestion, data validation and these create the model, then serve the model and then use istio to monitor the information as your traffic gets routed to your model and traffic gets served by the model. You also have CaDb for hyperparameter optimization. If you want to do some of that, you have that option as well. We also have Pytorch, Mxnet and Xgboost installed in this. So in pipelines we can make use of those platforms to build our models in many ways. Thank you so much for joining me in this session. If you have any more questions, would like to learn more, you can go to qflow.org to learn more about qflow and get started with Qflow. If you have any question to me, you can reach out to me at codes at any of the social media. It's at codes. So thank you once again to the conference organizers for giving me this opportunity. With that, I thank you all. And until next time.

Slides

Download slides (PDF)

See all 23 talks at this event!

Conf42 Machine Learning 2021 - Online

July 29 2021

E2E ML Platform on Kubernetes with just a few clicks

Video size:

Abstract

Summary

Transcript

Slides

Mofizur Rahman

Developer Advocate @ IBM

Join the community!

Featured event

2025

2024

Info

Conf42 Machine Learning 2021 - Online

July 29 2021

E2E ML Platform on Kubernetes with just a few clicks

Video size:

Abstract

Summary

Transcript

Slides

Mofizur Rahman

Developer Advocate @ IBM

Join the community!