Conf42 DevOps 2023 - Online

DataOps as a Service

Video size:

Abstract

DevOps revolutionized Software engineering by adopting agile, lean practices and fostering collaboration. The same need exists in Data Engineering. In this talk, Antoni will go over how to adopt the best DevOps practices in the space of data engineering. And the challenges in adopting them considering the different skill sets of the data engineers and the various needs. - What is the API for Data? - What types of SLOs and SLAs do data engineers need to track? - How do we adapt and automate the DevOps cycle - plan, code, build, test, release, deploy, operate, and monitor data? Those are challenging questions, and the data engineering space does not have a good answer yet. Antoni will demonstrate how a new open-source project Versatile Data Kit, answers those questions and helps introduce DevOps practices in data engineering.

Summary

  • DevOps revolutionized software engineering by adopting a lot of agile limb practices post recuperation. And we know the same needs to happen in data engineering. What we can learn from DevOps and how we can adapt and adopt it for data and make data ops.
  • Data Ops promise to ensure the most efficient creation of business value from data. If we're going to talk about data Ops, we should mention DevOps. The goal is to ensure most efficient software development from an idea to a reality to a software product. We will show how we do that today.
  • There is no clear separation often clarity between the team's responsibility. We need to start treating data as a product and not just aas a side effect from those source to the end. Westar data kit is a framework aimed to help both the data teams. It aims to abstract the DevOps journey and the data journey.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Let's talk data ops. DevOps revolutionized software engineering by adopting a lot of agile limb practices post recuperation. And we know the same needs to happen in data engineering. And today that's what I'd like to talk about. What we can learn from DevOps and how we can adapt and adopt it for data and make data ops. Now, data is in our everyday lives, everything around us, music, movies, healthcare, shopping, travel, school, university, everything relies on it. And there is no inspiration in saying that every company needs to be a data company. And those that are not, they are not very successful, not for long. And it's not a secret that efficiency of using data is still pretty bad. Nowhere we are nearly as efficient in creating data project as we are creating software products. Then way too many examples of failed data project Gartner multiple or other similar statistic comes out where aas easily 70, 60, 80% of data in AI projects fail to reach production. That's pretty bad. Data Ops promises to fix that. Data Ops promise to ensure the most efficient creation of business value from data. That's about data. Most probably the only thing that people who study data ops can agree with. It's go. But there are so many different variety of ways of how to achieve that, what to do, what does it mean? And there's still no some converging opinion about what the solution should be. There are some common themes and one of them is that we can learn from the success to cause in DevOps and try to adapt those for data, because this is not the same. And we will show how we do that today. If we're going to talk about data Ops, we should mention DevOps. It does have very similar goal. Its simplest, the goal is to ensure most efficient software development from an idea to a reality to a software product. Again, how it's done is varied, though there are now much better established best practices. Still, depending on who you ask, who have completely different idea about data DevOps. Really, there are a lot of best practices in DevOps community though, and we can award and apply and adopt and most importantly adapt them from data community. Before that, let's look at the problem from the different perspectives, different stakeholders involved, and we'll group them in two categories for simplifications purposes. On one hand, we will introduce the first hero of our story, the infrastructure operations team. What I'm talking about, I'm talking people who, when we talk about infrastructure sets, are the people who understand how to provision, maybe containers, virtual machines, how to set up firewalls and network for provision, spark Kafka cluster. They understand performance implication of that infrastructure. Maybe it's better to use big files with Kafka and small files with HFS for example. And operations people are those that say the best operations and default practices, how to build continuous integrations, continuous development, how to ensure code is version traceable. And their goal is ultimately to make sure everything works as written. They need to optimize reliability and availability. The other hero of our story would be those data practitioners, the people who really create the products end products. That could be for data. Those could be data engineers, data scientists, data analysts, analytics engineers, mo engineers. There are a lot of titles. They do have the domain a business knowledge and responsible for certain analysis requests from different stakeholders, marketing executives, so that either company can make correct quick decisions or can create compelling products. They do tend to have more domain knowledge, understands how to build data, project how to join data, set different tables together, how to report numbers and create predictive models to make recommendation systems. And their focus is on optimizing agility. Nowadays businesses need to move so very high speed and if the data does not catch up then business will be forced not to use data and probably fail. In some ways pair goes fairly conflicting as tend to be between products and operations teams because their periodics are conflicting. And this is fairly similar to what we observed in development operations before DevOps 2030 years ago. Here the data person wants to optimize cuddly value of data for operations. Why would like to optimize availability of that data and how we solve that? There is no clear separation often clarity between the team's responsibility. Where operations team had to debug data engineering work and data engineer engineers would need to provision infrastructure. Well, let's see how we can adopting and adapt them. And one of the particular lessons that we need to learn from DevOps is we need to start treating data as a product and not just aas a side effect from those source to the end, be it report or another product. And Westar data kit is a framework aimed to help both the data teams. Information teams sort of make sure that everything knows what needs to be done and everything is responsible for their own. It enable easy contribution to create new data projects and separate ownership. It does this by introducing two high level concepts. One is automating and starting the data journey in the types of starting the DevOps or data upcycle by talking about automating and abstracting the data journey which is primarily responsive of restarted data kit SDK which is a library for automation of data extraction transformations evolving of data and a very versatile plugin framework which allows users to extend this according to their specific requirements so that implementer people who know best, for example, that you cannot try to send big messages into Kafka, they can create very easy plugins that would automatically challenges the data before it even is being sent. And the control service which automatically abstracts the DevOps cycle would allow users to create, deploy, manage in automated way those data jobs in a runtime environment and allow automatic versioning deployment. And at the same time allow DevOps people in the company who best know how to build KCD to extend using their own knowledge and using their own best practices, they want to apply for their own organization. Well, let's look at an example. We're talking about automate abstracting the DevOps journey and the data journey. Here we can see how one infrared structure team can, for example, intercept through plugins every single SQL query being sent to a database before it even finishes the job, before even including when the job is being run locally during debug in development and apply some kind of optimization and layering. In this concrete example, in the picture we are saying there's a plugin who collects lineage information to enable easier troubleshooting and inspecting of jobs so that one can see where the data comes from. But this is just an example really. The sky is those limit the improvements team can do any kind of plugins and they can be applied across all jobs AAS. The teams can also create their own plugins. And then let's look at the DevOps cycle. Can we do something to automate how development process leads the DevOps cycle part? What versatile data kit? What the control service particularly does is it flatten it? Let's flatten it. And it's important to provide sales service environment to data engineers to create end to end data pipelines. This sales service environment automate large part of default cycle. So as far as data engineers or data team is concerned, they maybe click just one click button deploy or CLI command deploy and building, testing, releasing, deploying can happen automatically. At the same time, we need to enable some either decentralized data team or we go with our Persona some operations of infra team to ensure that there is a consistency correctness of those data jobs and all the compliance quality company policies that are in place. And since the people they have best knowledge how to implement these kinds of policies, especially DevOps best practices correctly, or the DevOps precious people, there's a way for them to ensure this across all jobs. Quick example again, the DevOps plugins are actually more of an build the Docker images that can be extended and one for example can extend the build and test those by implementing extending the default job builder image. And let's say they add some central assistant test to ensure quality. Or we want to make sure that no job can execute arbitrary files so we remove all execution privileges and that's very easy. It's pretty simple. Docker image which can be configured when installing the verstaile data kit control service. That's our intro into verstaile data kit. If you want to learn more and talk about us about these problems and try solving together with us, contact us at any of the channels. Those easiest one is through GitHub VMware style that you contact. Thank you.
...

Antoni Ivanov

Staff @ VMware

Antoni Ivanov's LinkedIn account Antoni Ivanov's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways