Conf42 DevOps 2023 - Online

DevOps best practices for DataOps Mesh

Video size:

Abstract

Can a Data Engineer be part of the domain team and implement the entire data lifecycle while fully owning the infrastructure? Passion for automation, collaboration, and elimination of dependencies between teams inspired Agita to dig into Data-Ops-Mesh combination and explore the implications of implementing it. She will explain each of these paradigms from a DevOps perspective, talk about best practices and explore how realistic it is to combine them. After explaining DevOps best practices, DataOps best practices, and the Data Mesh paradigm, Agita will share her vision of how data teams can take advantage of themĀ and eliminate dependencies by sharing lessons learned from her personal experience working in various DevOps and Data teams remotely and on-site. Agita is part of the Versatile Data Kit team and is working with the framework that automates the full DataOps lifecycle. She will briefly introduce the framework as part of this talk.

Summary

  • Agita will talk about DevOps best practices and how they influence data ops mesh or data ops and data mesh concepts. She will provide an overview of each of these paradigms and explore how realistic it is to combine them. Also you will see some graphics that I made.
  • DevOps best practices are mainly focused on automation and collaboration. I'm going to combine everything that I just talked about into this data ops mesh concept. Then I will talk a little bit about open source projects and invite you to offer a little contribution.
  • When I finished with this three months in the first project, I got invited to another project that was basically infrastructure automation or orchestration with chef. What struck me the most in that project was that it was really focused on collaboration. I still believe it is possible to have the same level of automation of collaboration between people if these teams are really connected.
  • Data ops is DevOps, or data engineering equivalent of DevOps best practices. The idea is to speed up the data deployment while improving the quality. Data Ops lifecycle is similar to the DevOps lifecycle, but a little bit different.
  • Data mesh was invented by Jamaic Degani in 2019. It's a type of data platform architecture that leverages domain oriented and self service design. The result will bring the speed to the data driven projects to be as fast as possible with no dependencies. Two things can go wrong or possibly prevent data ops mesh from happening.
  • Open source projects are like versatile data kit. They deserve the visibility that is crucial for open source tools or projects to get more known and used and also to gain potential contributors. A little support or a little contribution, like giving a star can go a long way.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
My name is Agita and today I will talk a bit about DevOps best practices and how they influence data ops mesh or data ops and data mesh concepts. I will provide an overview of each of these paradigms and explore how realistic it is to combine them. So it's very beginner friendly. And if you are data or DevOps engineer and you're curious about data best practices or data ops, this talk is for you as well as if you are a data engineer and you are curious about DevOps best practices, this also can be useful for you a little bit. About me at the moment I work at VMware. I'm part of versatile data kit team. It's an open source project that is automating actually data ops process. So that's why I'm curious to talk about it. I'm going to touch a little bit on promo versa dedicate because I'm on a mission to create a community around these project at the moment and I believe it's relevant to the topic and solves some problems that I will address. My personal connection and interest to the topic is that at the moment I'm a community manager but I have actually DevOps background. I worked as a DevOps engineer for around five years before transitioning to community management with three years burnout period when I didn't work. So I have some context and personal interest in talking about these concepts. And as a DevOps or ex DevOps engineer, I'm actually very excited to learn and use my experience to envision how data engineering teams can use their full potential following DevOps best practices. Learning about data ops and data mesh as part of this project that I'm working on now made me excited about this combination and actually I'm a bit of efficiency freak and I believe this can be very efficient combination. At the same time, I believe that there are some challenges that might arise and I'm going to also address these challenges and yeah, just final thing is that I want to say also kind of thanks to open source in general existence of open source projects because kind of after a little bit I mentioned the burnout. After the burnout I realized that I want to make this world a little bit a better place. And for me, working or coming back to data ops or DevOps world was only possible if I work on some open source project. Also you will see some graphics that I made. Actually in these slides you will see some graphics that I made and I just discovered that I have this little pen and I have a touch screen. My screen is working with this pen. And so I created several graphics here. They are not professional at all. This is my first time drawing but I just did it for my own pleasure and fun. So let's start. So my agenda for today is I will start with an introduction to DevOps best practices and then I will explain data Ops and how they are kind of correlated. Then I will go into Versailles data kit introduction very briefly because I'm going to explain data Ops lifecycle in the context of, or kind of with an example of the tool or framework. Then I will explain what is data mesh. And these I'm going to combine everything that I just talked about into this data ops mesh concept that well basically I came up with by myself, but I imagine that, and actually I have seen that some people also are talking or writing about this combination, powerful combination in my opinion. And then I asked or kind of the problems that I believe that might be there for the people who are trying to implementing both of these or combine these two concepts and also some solutions that I came up with under this research of creating this talk. Then I will talk a little bit about open source projects and just invite you to offer a little contribution. So for me, DevOps best practices basically are mainly focused, let's say, or I want to highlight two things. And first is automation and second is collaboration. Both of these are experienced, I think, on personally working in two projects. And my experience in DevOps was first I joined a smaller project of seven people and I worked there for three months. And the second project was quite large. I think there were several thousand people in that project when I joined as a DevOps. And I stayed in that project for four years and to explain what I was doing there, kind of maybe first project is more emphasized, I will kind of speak about it in context of automation and the second one in the context of collaboration. So when I joined the first project, basically what happened on my first day I arrived as a Junior DevOps engineer. They gave me just, I don't know, maybe seven to ten pages of a four printed paper with instructions that I have to follow every day. Basically my task was to support the migration of older sharing version to a newer spring version of the code or these tool that developers were working on. In order to do this, to support this migration I had to change or change several lines in some files, copy some folders, rename some things, do testing on Linux, do testing on windows and well yeah, kind of deployment tasks that were done manually, completely manually before I joined. And so what happened actually in my first week I managed to once go through the list of these instructions and by the end of the day, really I'm super happy. I'm turning to my lead and I'm saying, hey, hooray, I finished this. So I went through the instructions, so it's done. And my lead turns back to me and he says, you know, today actually we have to do this or repeat this four times more. And it was approximately, actually 06:00 p.m. So it was the end of the working day and I decided to prove myself and stay at work and actually follow the instructions and do it four more times. I think it took me 4 hours to do it the second, third and fourth time. And yeah, it got me frustrated enough so to say, to promise myself that I will not repeat this. And so the next day when I arrived to the office, I just decided to start writing code. I started writing code with the simpler things like copying files, renaming some folders, and then kept going and automation them more into changing the lines in the code. And basically the idea is that by the end of those three months when I was in this project, I have automated absolutely everything. So I wrote code to automate every step of this process that was written across these pieces of paper. Well, basically I learned, and to me it was kind of quite obvious that if something can be written on paper, it can be definitely automated. And the team was actually really surprised and I think for them it wasn't that obvious, but they were actually super happy to see how I automated this whole process, that by the end of those three months this deployment could be done completely automatically, I think around 280 times in 2 hours or something. So yeah, literally I had nothing to do there. And even though the team wanted to keep me around, there was just nothing, no tasks to give me. So this is how I believe that DevOps is supporting automation. And also it goes a little bit in the second point in my slides, which is faster, better, cheaper, so no human error is there like everything is happening way faster than any manual work. And of course it's cheaper to have automation doing things instead of people. I was using CI CD tools and writing jobs on Jenkins at the time and making pipelines that are automating these process. And yeah, basically following the DevOps lifecycle and orchestration was not part of my job actually in this project. So I go into kind of these next project. When I finished with this three months in the first project, I got invited to another project that was basically infrastructure automation or orchestration with chef. But what I want to actually and what struck me the most in that project was that it was really focused on collaboration. So at the time I was living in Latvia and the project was on site, so it was in Germany. So every Monday, every week I was traveling to Germany, and every Thursday I was traveling back. And at first it didn't make any sense, but with time and actually with experience, I understood that it makes a lot and a lot of sense to be in the same room with other people who are working on the same thing. As I mentioned that the project was huge. There were many developers and apps and things getting built, but all the people who were working on it actually were in the same house, like in the same office, or most of them, but even more, even further. The people who were automation things, not just DevOps people, but everyone who was doing any type of automation actually were in the same room. So I got to meet these people that I'm working with, and actually it was great experience to learn the methodology, the mindset and this collaboration to actually to grow in this collaborative space where whenever something breaks, we all get together and we are solving the problem instantly on the spot, as soon as possible. And it just increases efficiency to the maximum. When I know that if I have a problem, I can go to these particular person or I can ask someone who knows someone else who might conduct me directly with the person and then we talk face to face. And of course it was pre Covid or before COVID times, and now the world is a little bit different. But I still believe that it is possible to have the same level of automation of collaboration between people if these teams are really connected. And this is going to be really relevant when I talk about data mesh, because splitting teams or putting all the relevant people in the same room or the same team even virtually, will make the big difference. So, data ops best practices basically, in a nutshell, data ops is DevOps, or data engineering equivalent of DevOps best practices. The idea is to speed up the data deployment while improving the quality. Data ops lifecycle is similar to the DevOps lifecycle. I'm going to dig deeper into it in the next slide here. I wish to mention that data Ops lifecycle actually hasn't been really agreed upon in the community. There are several options and several ways to see it, and I'm going to present my personal, let's say, subjective view on it after reading or researching on the topic. The difference between traditional data engineering and data Ops is that data Ops engineers work with data in an automated way, building their workflows or data pipelines and jobs that run automatically. So basically these data pipelines and jobs are kind of similar language or taken also from DevOps World. So if previously one data team might be dependent on another data team or on infrastructure team because of the sequence of data journey or some infrastructure permissions or accesses, now it's kind of solved. So data Ops is solving this, let's say decreasing dependencies, but I would not say that it's completely eliminating dependencies. I believe also that the people who are building data pipelines should be enabled to set up the infrastructure and orchestrate the entire data journey. So in perfect case scenario, data pipelines and infrastructure are built and maintained by the same people, which is not always the case. But still I kind of want to focus or have this in mind when I'm talking about data Ops. Still data teams are sometimes separated, but let's say I believe that it shouldn't be the case. And as I talk about this about data Ops, I take in consideration that they are doing the same thing together. So orchestration and building the pipelines. So as I'm going to jump into data Ops lifecycle and kind of give a practical example of it with versatile Datakit tool, I just wanted to highlight or explain a little bit what it does. So versatile Datakit is an open source project, as I mentioned, and it is found on GitHub. The code is there. And basically what it is is a framework that is created to build, run and manage data pipelines with basic Python or SQL knowledge on any cloud. So the emphasis kind of that is maybe relevant, and I believe relevant for this talk is that we use the word basic and it's going to solve a little bit later. One of the problems that I'm going to address. So basically I'm going to explain also what is a data pipeline. A data pipeline is a series of data processing steps scheduled and executed in a sequence, the same as in case of DevOps. But this pipeline exists in order to ingest, load and transform the data from its source to where data analysts can work with it. In some cases, the steps are also in sequence. In some cases, independent steps might run in parallel as well. Yeah, and now I will go to data Ops lifecycle to kind of explain also how it comes all together. Yeah, so basically this is the DevOps lifecycle as I see it. Plan, code, orchestrate, test, deploy, execute, monitor and feedback, and it goes and it cycles. It is similar to DevOps lifecycle in my opinion, but a little bit different. So I'm going to go through each step and then you can see the difference for yourself. So of course the most important, I believe part of the lifecycle is to plan. As we know, failing to plan is planning to fail. So this is the crucial step where the business value users and requirements are gathered and the tools are selected and everything that needs to be answered, needs to be answered here. So if these plan is solid we can proceed to the second step. The second step is code. Coding in data vault means writing the code for a pipeline to ingest, transform the data and test it locally. In our case, virtual data kit automates this part by introducing software development kit. The code can be written in Python SQL interchangeably as I mentioned, and is used to create data jobs with steps that run in specified sequence. Also database is selected and configured and data jobs are executed locally to test the code and make sure that data is ingested and transformed properly. So simple commands like VDK run is going to run the whole pipeline locally and give me the output and I can check if these desirable outcome is there. There are many tools, other tools that automate this step. Actually it may be using also other languages or even taking coding part out of the equation and providing can interface where data practitioners can do this by clicking buttons. For some it might be very helpful and for some it might be frustrating that actually we cannot debug as easily things. So the third step is orchestrate. This is a crucial operations part of the cycle. This step actually is independent from let's say code but definitely needs to be implemented or brought in before testing part. Yeah it can be also these case that these infrastructure team is setting up orchestration and data team is doing the coding part. But as I said previously, it's better when it's done by the same team to have a full ownership over the pipelines. The infrastructure is built here for the data to go through the environments like sharing where it gets tested and further promoted to production environment or preprod and used for analysis. Typically the code is pushed to git and then data is ingested into staging before it gets deployed to production. At this point, scheduling and orchestration of the pipeline is configured in the configuration files. In case of VDK, orchestration tools can schedule jobs, manage workflows and coordinate dependencies among tasks. VDK has implemented a control service component that is taking care of the infrastructure setup. So it is kind of also taking part in automating with this SDK, the code stage and the orchestration as well introducing this control service that is creating the infrastructure. Then the next step is testing so the testing once a data job runs on staging, the data can be tested. Now there are alerts for any user or system errors that can be tracked end to end. Testing ensures that no existing functionality is impacted and maybe some things that haven't been considered in the workflow, such as what would happen if and so on, and any type of bugs or discrepancies are fixed at this stage. Deploy stages after validating the data and resolving issues, it can be deployed to production. Deployments can be fully automated. So let's say if the data gets to staging and is tested automatically, it can automatically also get deployed to production or another way also can be introduced. Okay then execution execute step is the pipeline now is running automatically based on the configuration. Automated data lifecycle processing is in place which schedules ingestion, transformation, testing and monitoring. Now reports can be generated. VDK control service has functionality for both deployments and execution of the jobs. Monitoring is in place to track failing as quickly as possible. Usually it's automation set up to alert the data jobs. If the data jobs are failing, alerts will be sent out to the user specified email and provide containing data job name and type of error. It is also possible with VDK in this case to detect if it's a user side error or a system error, or like a configuration error or a platform error which is going to be also included in this alert message. Then monitoring is going to provide this information to help users to troubleshoot and fix these pipelines as soon as they get the alert. The final step is feedback. And these, these additional requirements might arise. Some data might be missing or something might be necessary to improve. So when the feedback is collected again, the planning can be done and the cycle repeats. So here I sketched a little bit with my little pen. Also some components of EdK that I was mentioning here in this previous slide for the visual representation. Just because I understood for myself that I would need something like this in order to understand it better. Some of the components, but not all of these, but basically, yeah. So this is, let's say data part, and this is like Ops part that is combined in the Versailleskit project. So data jobs are running in Python SQL. There is a command line interface where I can run jobs locally and test them. The data jobs are following these steps and actually it's prefixed by in alphanumerical order. So let's say the name of the file is going to also be the sequence defining the sequence it is doing. ETL or ELT. Actually automation which is extract, transform and load by providing plugins and templates and in general automating these parts. So not really in depth knowledge is necessary to do them. So basically it's for locally running these jobs and then the other kind of ops component is scheduling and execution of these jobs in kubernetes environment using Git. So we upload the code to git and then we deploy it and from there we can set secrets if necessary and monitor the pipeline. And now I jump into data mesh. And data mesh was actually quite a new concept, I would say young, way younger than DevOps and younger than data Ops as well. But nevertheless it's extremely popular now. I think people are really considering this as a really good practices. So it was invented by Jamaic Degani in 2019. And basically it's a type of data platform architecture that leverages domain oriented and self service design to embrace the ubiquity of data enterprises. In this case, the domain is a business area, meaning each business area owns its data, and data measures foster data ownership among data owners who are held accountable for delivering their data as products. And so data as product actually kind of comes in. Each domain is responsible for managing its pipelines, and once the data has been served and transformed by a given domain, the domain owners can then leverage the data for their analytics or operational needs. As the MAF argues, data architectures can be mostly easily scaled by being broken down into smaller domain oriented components. In short, data mesh means that the data owners or domain owners, people who are directly involved with particular data, are also building and maintaining their data pipelines. Users are becoming the owners, and so the data becomes a product and is self serviced. This alone also does not completely eliminate dependencies because let's say if we implement data mesh, and the orchestration and testing and CRCD are set up by another and managed by another team, then the domain owners will depend on the infrastructure engineers who will be supporting the orchestration. So this is the reason why I decided kind of to combine these two paradigms or concepts. So here is the data mesh on the left, like adding governance, self service data as a product, and domain ownership to the data cycle and data ops part of it that is providing these CI CD testing, observability and orchestration. Basically kind of seeing this image, this is not my original image. I found it in an article and then I kind of made it into my own graphics. But this is how I also inspired to think about the combination of data ops and data mesh, which I strongly believe enables true collaboration and powers up the data engineering process by completely eliminating dependencies. So if the data ops engineer owns the data pipelines and is enabled to have ownership over the infrastructure and orchestration of the data cycle and data mesh means that the domain owners are owning the data as well as the pipelines, and in this case infrastructure too. I believe that the result will bring the speed to the data driven projects to be as fast as possible with no dependencies, which I believe is already happening in these DevOps role. Because sometimes, let's say full stack developers or DevOps engineers can set up the infrastructure, build the pipelines, or be in the same room with the developers who are directly working with building these product. So how VDK supports data mesh? Well, data Mesh is an organizational concept. It is implemented by managing the teams and enabling them to work fully with their data. Besides these, automation of the process, versatile data kit introduces the functionality of creating teams. So basically, as a data job is created, the prerequisite is to also specify which team is going to be working on it. Besides the team functionality, VDK also has templates and plugins to support these teams. And in order not to reinvent the wheel each time, teams can share the work they've done with each other and collaboration more efficiently. And these are some of these functionalities that can support efficient data ops mesh implementation. So what can go wrong? Well, I believe that two things can go wrong or possibly prevent data ops mesh from happening, and the first one is the skills. So kind of while I was researching this, I was thinking that basically I was thinking that the required skills to execute data combined with data mesh successfully are so skills or knowledge. The person in these team, in each domain should have the knowledge about the domain. Then they have to have data engineering skills like Python or SQL or anything else, depending on the tool that they're using for writing their pipelines. And the third thing that they need to know is how to set up their infrastructure and basically DevOps skills. As I was creating these slides, my question was whether it's easy or even possible to find a person who might fit and have all these skills that are necessary for this, or actually if it's possible to train them, or even if they would be willing to learn. But as I presented this talk to my team, this presentation, and actually this topic in general, it became obvious that the skills issue actually is solved by VDK or other tools that can automate the data engineering and operations process. So as I said in the beginning, that basic Python and SQL skills are required to build, run and manage data pipelines with VDK and as far as I know, and also from my personal, let's say, DevOps experience, we learn every day as DevOps and we do something, we use new tools, new languages every day. So any person with understanding and kind of capability of learning, I believe, could be able to create a simple pipeline to start with, but also to create more difficult pipelines also as well with time, just by following documentation. So actually not just creating a pipeline, but also setting up infrastructure, following the documentation. And this could be just one of the many solutions to implement data ops and data mesh in real life. But this is one that I see that is or creates this possibility. And the second thing that I wanted to kind of highlight or that came to my mind as possible issue is that basically in my opinion, it's important. The trust is important because domains will now have full ownership over that data and infrastructure. So when it's possibly more reliable to have the infrastructure set up separately from the domain and simply give access to the data, some companies or some leads or some management might not have enough trust to allow their domain owners to own the infrastructure as well because of the risks that come with the full power over it, like accidental deletion and any type of kind of errors, human errors also. But I believe that this can be solved by implementing some rules and functionality too. But still, this question remains kind of unanswered to me, and I'm curious to see how data ops and data mesh will evolve in the future. So with these questions, it kind of sums up from my side on the data ops and data mesh. And yeah, to close this talk, I wanted to thank you and say that I'm really open to hearing some feedback. You can find me with my name, Agita Yancem on LinkedIn, any social media. I'm also on Twitter. Connect with me and I'm really happy to kind of explore these concepts even more and actually find out if they work for some people. And now I want to just dedicate a little moment to speak about these open source projects. And I believe that open source projects are like versatile data kit and actually others are creating or kind of enabling us to not to pay a lot of money for some functionality so we actually get some free tools that we can use. And I believe that also they deserve the visibility that is crucial for open source tools or projects to get more known and used and also to gain potential contributors. And what I want to say is that actually there is a little support or a little contribution, like giving a star can go a long way. And I suggest or invite you and I will be really grateful if you support me my team and actually what we do at the moment by creating this tool by giving us a star so you can just scan the QR code or just google versailles etiquette kit and on the top right corner there is a little star and if you would just spend a minute or so just doing this it could support me greatly and enable some people to reach and to find the project to use it or to contribute or at least to try it out. So yeah actually that concludes my talk. I wanted to say thank you so much for taking time to be with me I deeply appreciate having the opportunity to be on conf fourty two and also I want to thank the organizer these conference is organized very professionally and I feel so far really positive about how it is managed so as I said I welcome feedback connect with me and my name is agite and thank you so much and see you next time bye.
...

Agita Jaunzeme

Data Engineering, Community Manager @ VMware

Agita Jaunzeme's LinkedIn account Agita Jaunzeme's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways