Conf42 Kube Native 2023 - Online

Applying DevOps practices in Data and ML Engineering

Video size:

Abstract

Much more efficient ML and Data Engineering can be achieved by - reducing dependencies between teams - enabling everyone to focus on work that requires their core skills - automating and abstracting data infrastructure and DevOps processes as much as possible And we will demo how.

Summary

  • Antoni Ivanov: Today we talk about applying DevOps practices in spatial data and data as a service. Have you watched Moneyball? This is a really great movie and a book about out Oakland athletics baseball team. Now analytics and data being actively used by all successful teams.
  • Data is actively useful and being used to create data applications. A data application is also applications. How do multiple applications communicate between each other? API is the standard between which any application communicates.
  • API is communicating between two applications. We need an API between our data application and downstream applications like BI or other software applications and services. There are no tools, there are no customs for creating API for data.
  • This APIs for data could be also thought about as creating a data contract. How do we map the data journey to the DevOps cycle? It's fairly intuitive, right? Once the data has been stored, ingest and transform. It's ready to be deployed for using reports.
  • Wastal data kit is an open source framework for data engineers to create end to end data part points. It allows you to plug in between data infrastructure and the data applications, allowing a lot of control over the DevOps cycle. Now let's look at more of the operations and monitoring and deployment part.
  • In summary, what do we think about by DevOps for data? One spec is API for data. The whole pipeline from sources to insights should be automated using the best DevOps practices. This is really direction that we need a lot of more research and more work to do in Versaille data kit.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Antoni Ivanov and today I'm going to talk about applying DevOps practices in spatial data and data as a service. Before we dive in, let's talk a little bit about moneyball. Have you watched Moneyball? This is a really great movie and a book about out Oakland athletics baseball team in general manager Billy teams the story of Money Ball is a real one starts in the 2000s with Oakland Athletics, a Major League Baseball team faced with budget constraints that severely limited its ability to complete for top talent. So they were in the bottom right. Bean cooperated with Paul de Podesta, a Harold graduate with bavarian economics, to apply statistical analysis to player recruitment. That means that he was looking at players that most of the other scouts were missing because he was using statistics instead of intuitions to make his decisions. Not only that, but during games he was prioritizing on base percentage of a batting average, understanding that getting off base was often more valuable than hitting home runs. This totally changed the team for just one year in 2003 seasons, Aquaman 20 went to win 20 consecutive games, setting American League record. This created paradigm shift in the entire baseball where now analytics and data being actively used by all successful teams. This just goes to prove that good decisions, correct decisions, are generally data driven decisions and intuition based decisions generally don't make it. Today we're going to go quickly over agenda. We're going to talk about data, applications and API for data. What are seos and seos for data and the data circle for data and the open source products that I work on with style data kit and how it tries to solve some of the challenges for data engineers. So what are applications? There are many different types of applications, right? We have ecommerce applications where people want to buy things. Of course there are lot of mobile applications, iPhone, Android, different it systems that are used to track different business processes, accommodation system. All those application have one thing in common. Every single application it generates data. Huge amount of data. Things like databases which we use to store data about customers, right? Or log files that we use to store the world operational data, click streams about events, about what is happening and how the customer using the product metrics. Operational metrics help us maintain our services. One thing that most people don't realize is this data is actively useful and actually being used to create data applications. Data analytics for example. We use all kinds of usage data to be able to understand better how our customers interact with our products in order to make them better. We use business intelligence to understand how our product behave as well to make better data driven decisions data science teams built machine learning models so that they can recommend the best features to the customers that they need. Finance team also need to do forecasting models and so on. Those are all things that rely on this kind of data, let's call it data existence, that's produced about out of those applications. A data application is also applications. They're simply focusing on data structure slightly different. So what is the usual data journey I'm going to use? For example, building ecommerce app. The primary function of ecommerce app is to be able to enable customers to buy and sell by products, right? They have product catalog, shopping cart, process transactions. And this kind of data is saved in systems like events are usually saved by Kafka or other message queue in the databases and walks. So if I want to create data applications, I need to ingest the data. That's what we call data ingestion. All these kind of data sources needs to be ingested in kind of analytics system so they can be aggregated, joined to each other. After the data is ingested, we need to ensure that it's accurate, consistent, updated. That's why we need to transform it in a format that other teams now can start use. So our data applications is going to be also create another data applications. So we need a good model, sort of like stable model with quality, test it and then maybe if you want to work in ML training models, we need to do feature engineering data enabling data partitioning with Anatabo data automating to create our ML object. And this is used by all kinds of other applications. This could be bi applications to make data driven decision reports or data driven on other data driven products. Five recommendation service this of course relies on a lot of data infrastructure that needs to be built. And all of that we call data application whole thing. That's the data journey. The data teams are the one that general responsibility data applications and some kind of, let's call it infra operations teams, for instance, is the one that sort of provides infrastructure to the data teams. But how do multiple applications communicate between each other? Regardless of talking about data application API API is the standard between which any application communicates between each other. What are the API components? First, we have interface and contract that includes employee definitions and the request and response format. Whatever it is. This sets how the API is going to be used, right? Of course we need to make sure that the API is secure authentication mechanism, so only people who have right to access endpoints have access. It's important. The API is usable. It's easy, intuitive, it provides quant libraries and the features are easy to use, integrate in other applications and of course it needs to be monitored and operated. Logging, monitoring track usage, manage traffic, ensure smooth operations are critical. The same thing goes for API for data, but one of the main changes is what's really the usual interface in coderack for an API for data. Continuing with our ecommerce app example. So we have two data sources, iotpdbs three service which contains different types of product information. We need to ingest this data to generate copy it into our data lake. This often lately means databases like Snowflake, Redshift and so on, not necessarily just bob storage. And now we still want to expose more of a process all combined product view that we guaranteed contains all the necessary product data. So let's call it our product data module, which in this case what is our two data sets? But you can imagine it could get much more complex. So where is the API here? Well, with API is communicating between two applications. We have of course API between the source applications and our data application and we need an API between our data application and downstream applications like BI or other software applications and services. Let's focus on the right part. We don't have actually good API for data between the sources and the raw data. And how can we build those? I'll give an example using the right part, but our product data model, but the same things goes for the API for data needing on the left part. So let's say that we want to create this kind of table or entity of products which contains information about this. So what do we consider API of this kind of product data set? Well, there's schema very important. For example in this case this could be the different column names and the types of each column name are part of the data based on the data schema definitions. What else? Data semantics. Each column name has certain semantics like name being non named, string representing latest official name. We can notice we need to be as specific as possible. And finally we need a way to access it. So data access it could be tables in database using SQL. We could access it by using Python enterprise that is setting parquet or arrow data formats. So if you go back to our data journey, where do we need API for data? Where do applications communicate? So as we establish between the data sources in our main data application, between our data application output, be the data module or the module object, and the BI tools and data driven programs that want to bring any value and insights out of this data. But we can also of course think about data application. It's a huge thing and we probably want to make it into subcomponents. So probably we need even internal APIs within different components of our data applications, like between the data that's being ingested in raw data lake and the dimension module, or maybe dimension module and the module object that's being created by training the data. And this currently is missing. There are no tools, there are no customs for creating API for data. And that's one of the biggest gaps that we have in the data engineering that currently do not exist generally in the software engineering. The call for here is that the data sources generally in control of the application developer. That's if you are developing any application service, mobile app, ecommerce app, whatever, you are outputting this kind of data into different database entries in your IOTP database, and those things are going to be used in a data application. So we need to actually put them under API. Consider them also can API. What else do we need for this API? Well, we need to have service level objectives, right? Metrics which guarantee certain processes to the API users. And that's where service agreements also come. Generally those come from the data semantics, a large point. Once we know the data semantics, we can create rules and those rules would calculate our data accuracy seos then we have other more standard seos in the data and like availability seos, right. How often the data is quarrable. We also of course have freshness sels which is very important that the data needs to be available in the analytics system within 1 hour, for example. It depends. It's important that our data precious co is the correct one for some hours is okay, for other seconds is needed. This APIs for data SEO for data could be also thought about as creating a data contract. There is a lot that's being talked about data contracts currently in the industry. I encourage everyone to go check out both post by Chad Sanderson on this link to learn much more about it. They're pretty great. He dives deeply into the team and the third part we want to discuss is DevOps cycle for data. With simplicity, let's sort of flatten out the DevOps cycle. We know the standard DevOps cycle is one God, right? The first path, build, test, release, deploy, operating monitor. So how do we map the data journey to the DevOps cycle? It's fairly intuitive, right? The first thing you need to do is to plan. So you need to go and discover the data. You need to go and explore the data, find where it is, find for example where the product info is kept. Is it applying s three is it capping database and try to access it. Then you need to ingest it, actually core the ingestions, build the data ingestion pipelines and probably transform it. Transformation log this is another type of core information log needs to be encoded, usually using SQL or Python or something like this and built in data pipeline. Those roughly go into the blank code in build stages or generally tasks that you do during those stages. Once the data has been stored, ingest and transform. It's ready to be deployed for using reports and it data visible machine learning modules or other applications, be it data applications or normal software applications. So we had the whole depth wall. We need to test the data or release it and deploy it similarly as we do in the data DevOps cycle. And of course we need to maintain it. It must be consistently managed to ensure it stays accurate, secure and available, similar to how applications operate. Monitoring DevOps environment let's call this managed data. Now that we sort of covered all the aspects of how do we apply DevOps for data API, SE World and DevOps cycle, I want to introduce wastal data kit. It's an open source framework which provides solution for users to have self service environment for data engineers to create end to end data part points using this kind of code, first decentralized, fully automated way. Its focus is more on the DevOps part and the data journey part. But let's see. So it provides really two things. One is the SDK or a framework almost. You can figure the spring for data which you can allow you to develop data jobs using iterpython so you have some methods to extract raw data or secure. You can do things like parameterized transformations and it allows us to deploy monitor using control service and operations UI. This data and data applications that you're building, we call them here data jobs. Looking at the data journey versatile data kit ultimately fits the ingestion and transform part. It can be used also to train data modules and generally you can use to export data in itself. Constell data kit is a framework, so it allows you to write the code that you need to ingest, transform, train, export the data and integrates with different databases which can like Snowflake, empower, postgres, MySQL or different compute engines. You can use like Sparklink to actually make the heavy lift computation a water focus has been meant to simplify the heights complexity of data infrastructure and generate DevOps for the data team as much as possible. So it allows you to basically plug in between data infrastructure and the data applications, allowing a lot of control both over the DevOps cycle and over the monitoring aspects and the infrastructure to the sort of infra operations team. So the data teams can focus only on data and the overall watch mount about the software part. So for example, let's show a quick demo of what I mean. Let's say that we have this kind of databases and the data teams are developing their data application, running SQL queries and maybe they're running such big SQL queries that are breaking the database. That's it. So what do we do? We have to either block the data team or ask them to stop. Would be much nicer if instead of finding out when those queries land production, we find that you can give up a time as early as possible. And that's what you can do with Versaille data git. Let's say the data team is here developing their job, running this kind of SQL queries and accounts from employees and different, using different, all kinds of different methods. What completely data teams without even knowing? Of course you will know. The platform team or operation team or data center team could evolve this kind of VDK query validation. It could be as simple as compare the size of a query or it could be much more complex by doing some kind of analytics on the query. It doesn't matter because VDK allows you to intercept both the data and the metadata in plugins. You can intercept each query statement and decide what to do with it through plugin. In this case, let's say that we reject all queries bigger than 1000 expressions or 1. While the data user and data team is developing their job locally, they'll complete the error immediately. The query wouldn't leave their work environment. I don't think there's a huge benefit to be able to do this very quickly. This type of query validation can be used to enforce data quality rules and data APIs. Basically as early as possible at the data, at the query, when the query is being developed, basically the data model is being created and that's pretty powerful. Now let's look at more of the operations and monitoring and deployment part, not just in the developer part. So how can we help with the deployment part? So yeah, let's go back to our DevOps cycle. What we want to do is enabling fruit operators or centralized data team to sort of have this kind of more control over how the data applications, the data jobs are being built and even tested, how they are being released and deployed. Because there are a lot of concerns. You have to think about version, you have to be containerization, you have to think about adding metrics, metadata and so on, creating docker images, creating cron jobs or kubernetes objects. This all should be completely hidden out for data engineers and data user data because you want them to focus on core data modeling and applying business inside, applying business transformation insights on the data and not to worry about all these kind of software engineering concerns, DevOps concerns, and that's very important for them. So data teams focus on planning, coding, data application. Of course they need some way to monitor their data while the let's call it it team or whatever, this small DevOps team can establish both policies through this kind of extensibility mechanism the VDK provides. What policy? Well, let's look at example, very simple example the way by default VDK also install a kind of dependency package, the data team job ready for automatic executions in the cloud environment. In VDK case it's using kubernetes. But let's say that we want to make sure that we add some centralized system tests for all jobs that verify certain quality or super API contract or center metrics. We can do this very simplified everything. The way plugins are built in the control service are through docker images. Basically we can extend a style data into a builder document. Anything you can run system teams, you can remove execution privileges. And this script is being run during the build and test phase. So before anything is released and deployed, after the job is being built, they can run this kind of test and change the job scripts by this way guaranteeing certain level of quality. So if the system test failed, the job wouldn't be deployed and security by removing unnecessary permissions. In summary, what do we think about by DevOps for data? Well, one spec is API for data, right? API for data means we need data enabling, we need data semantics defined. We need some kind of validations and documentation that people can explore the data, right? They need to be able to access it easily, find out where it is, explore it's best well documented, the same normal API, we need seos and teams for data. So we need to be able to collect metrics about the data, things like freshness, things like unique values and so on, and what if metrics. And we need the data to follow the DevOps cycle. Basically the whole pipeline from sources to insights should be automated using the best DevOps practices, which means ability to plan code, ingestion, transformation logic, ability to deploy that logic and to manage it to operate it. Versatile data kit actively helped in a lot of those, especially in the DevOps data part. It also through plugins could help, as we discovered, to create can enforced API for data. This is really direction that we need a lot of more research and a lot of more work to do in Versaille data kit. So I will be really helpful if anything that you find interesting here. If you'd like to learn more, just get in contact with us. Yeah, you can do this by going to the gift of repo and starting a discussion, contacting our swap channel, writing an email or contacting us on LinkedIn. And there are a lot of work to make sure that data engineering practices are able to adapt and adopt good staff engineer and DevOps practices. A lot of this was discussed is what it would be nice for staff engineer to have already and it's still missing dollars which were started. We tried to bridge the gap. We know that we have a lot of work to do and your input, anybody's input will be extremely valuable. I hope you get in contact with us and thank you again for listening to these 30 minutes of me talking. And have a nice day or night or whichever part of your.
...

Antoni Ivanov

Staff Engineer @ VMware

Antoni Ivanov's LinkedIn account Antoni Ivanov's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways