Conf42 Site Reliability Engineering 2022 - Online

What is Data Reliability Engineering? and Why it is Crucial to any Data Organization?

Video size:


Software practitioners work to make their systems reliable. We hear teams boasting of having four or five 9s of uptime. This is not the case for Data Services. Data is not often 99.999% reliable. Systems are often out of date or out of sync. Pipelines and automated jobs fail to run. And, sometimes, the data sources are just not accurate. All these situations are examples of Data Downtime and lead to misleading results and false reporting.

Data Reliability Engineering is the practice of building resilient systems. By treating data systems as an engineering problem we can borrow tools and practices from SRE to build better systems. Together let’s explore how to take this natural extension of data engineering to make our data systems stronger and more reliable. We will explore three major topics to strengthen any pipeline:

  • Data Downtime: We will talk about what is Data Downtime? How does it affect your bottom line? And How to minimize it?
  • Data Service Level Metrics: We will talk about metadata for your Data pipeline? How to report on pipeline transactions that can lead to preventative data engineering practices.
  • Data monitoring: What to look out for and how to be aware of system failure verse data failures.


  • Data data reliability engineering is crucial to your data organization. Key topics include finding if your data is reliable, decreasing your data downtime, utilizing data level metrics, and treating or creating a platform for data observability.
  • A data reliability engineer helps maintain the reliability of data pipelines. Any service that relies on data to function needs to have some kind of data reliability in place. As data is democratized across transactions, reliability in that data becomes more crucial.


This transcript was autogenerated. To make changes, submit a PR.
Hello, I'm Mariah Peterson and I'm really excited to speak to you today at this site reliability Engineering session of ComP 42. I am currently a member of technical staff at Telscale, and we will be talking today on the topic of data. Data data reliability Engineering. It's crucial to your data organization, so we'll start with what is data reliability engineering? Just as Google's published SRE practices have been implemented and re implemented across various software engineering organizations, implementing these same practices to data organizations and data engineering and your data services is senior data reliability engineer. Implement these practices with these focus of decreasing downtime and improving client experience or the experience of data users and data practitioners. So main key topics are finding if your data is reliable, decreasing your data downtime, utilizing data level metrics, and treating or creating a platform for data observability. Reliable data is a little bit different than what you would say for a reliable web service. There are many aspects to reliable data. They include accuracy, freshness, missing data duplication, but they also include the latency at which data is available. If you can access data, it depends on the databases, the data stores, the data gateways, and you want to be able to maintain a first class or a 4959 experience for all of these aspects of your data. Additionally, when you're building systems for reliable data, you have to understand that systems are not perfect. That's why we're not aiming for 100% systems. We do want to create budgets that allow for failure and that by leveraging these budgets, we have a first class team that understands what a data incident is and can respond quickly, right? So that we have a cushion for intervention. We can respond quickly, efficiently, so that the customers don't notice large outages, that we have minimized our downtime and we have created an experience that says this data is reliable and we are willing to create that first class experience. Data downtime, like I said, can have various parts to it. Bar Moses, the CEO of Monte Carlo, which is a data observability platform, has long proliferated this definition where it's not just your data being unavailable, but if your data is erroneous, if it's only part of the picture, if it's inaccurate in any way, that is data downtime. And that's what we're trying to, in can iterative way, leverage signals and metrics, our error budgets and our engineering time to minimize that, to make our data as reliable, as close to the customer's expectation, the practitioner's expectation as possible things we can do, just like with service level metrics. You can put metrics at a data level to capture the stability of your data pipelines, of your data stores, of your data gateways or other data services you can create pictures of if your queries on data stores are taking too long, if you're taking too long to train a model, if your data gateway has an unexpected latency, errors, returning an inappropriate error, you can use all of these informations put together to create a picture and explain visually through metrics and other metadata. If your data is reliable and through anomalies, create lets that allow us to investigate them. Like I said, we can investigate latency, see which is our most traffic services, which services are experiencing more or less traffic than we expect. If we have error messages being thrown when they shouldn't be, or serve being 400 errors, 500 errors on API gateways, maybe you're having a situations on your gateway or on your database or on other data store you don't expect. These four things are part of the golden signals expected out of the SRE handbook. Observability is one step beyond those metrics. We take these metrics and create slas, slis and slos to make sure that our data services are performing to what your data practitioner needs. That way, as things come up, right. Our metrics are reporting something unexpected. Through those slos and slis, we can address that either immediately, depending on the criticality of that error, we can throw it into a bug, we can give it to a support or a support reliability customer engineer. And using procedures, we can minimize these downtime and leverage a decreased time to resolution depending on severity. For example, say we have a data pipeline that is taking CRM data in and it is training a model on that, then creating a couple of dashboards, deploying that model, and then releasing those dashboards to sales staff. Right? The model is used to make predictive leads, and the dashboards are then used by sales leaders to motivate, to encourage salesmen to help with marketing. There's a variety of things that can be done with those. So we create an SLA for this CRM pipeline to have the data be no less than the model. The dashboards be 24 hours old, they cannot be any older than 24 hours. We take that and put that into, yes, revenue dashboard with minimal freshness of one day. We take that and we transport that into an SLO, right. An objective for our data service or our pipelines that is creating that dashboard. So we know that we want this pipelines to extract data from the CRM, complete its transforms, training and analysis at least once a day. And if something happens, we want to be able to give us can error within a reasonable time, right? So that we can manually intervene and our SLA is never broken. If something comes up and the automation fails, this allows us to keep that customer or the practitioner expectation without compromise with some kind of a fall plan while maintaining that reliability for our systems. So then we have sois, right? These are the indicators that alert us if we need to as data, data reliability engineering involved. These can be errors if something happens in the pipeline, right? Maybe our model doesn't complete training, maybe one of the transformations doesn't happen. Maybe we can't connect to the CRM. Maybe we time out and we sre not able to write to our data store or we are unable to. That timeout could be on the CRM or on the training. And these create alerts, right. These slIs, if they indicate above your threshold, they create alerts and they allow that practitioner to know, oh, there needs to be a manual intervention to maintain that reliability standard that we have with our data practitioners or any of the stakeholders at the other end of our observability pipelines. So great data reliability makes sense. It's transposing your normal reliability practices to data systems. Why do we need it? Right. Great, the CRM stuff works. But what about more? That's not as critical as some of the other systems that rely on data. We have in one of my favorite books, designing data intensive systems, it describes these types of data projects that all need data reliability to be infused into their project. The first one is your software data stack. Right. So this would be, we'll get to it later. Your software data stack, your enterprise data stack and these of course your modern data stack, the software data stack. Let's start. What is it that is these data store for any customer having application, right. This could be a database to a microservice. This could be an in memory cache. This could be any of your cookies stored in a browser, other kinds of application information, but any kind of data that interfaces directly with your software service. Typically who manages the reliability of these data stores? This usually falls within the spectrum. Senior senior data reliability engineer have tools like they often come from operations backgrounds. They understand how to spin up databases. They have hosted databases and they don't usually case to the point where you need some kind of specialized knowledge for this kind of reliability interface. The enterprise data stack is quite a bit different than that software data stack. First thing first, it is your infrastructure structure or data platforms for enterprise data services. These are your large scale distributed databases. These are your large data warehouses that are used for reporting and bi, these are SAP databases, oracle databases, any kind of large data infrastructure that has grown into what various enterprise companies large scale enterprise services need who manages their reliability. That reliability for this requires a special set of data skills and it is typically handled by your DBA or a data reliability engineering. Somebody who has that specialized database administration knowledge can do sharding query optimization, has managed many flavors of database, SQL or other custom database languages, but still would be the first line of defense if you have an error from your metric reliability issue comes up. Now, what is this modern data stack? I mean, what's left? We've talked about software, we've talked about your large scale data. What's left? These are any kind of pipeline and analytic that's used for machine learning. These are your ETL transformations that are used when you're extracting data for data scientists, basically anything that can be used for data, right? Creating data gateways, creating data mesh, creating custom machine learning pipelines on top of your data warehouse, perhaps right where you're taking that data. Another step with extracting it from that warehouse and maintaining a pipeline, ETL pipelines that perform platform crucial transformations, right. Oftentimes you're seeing it very common that information from your analytics data warehouse is being used in software applications. You're seeing machine learning models being shipped to production as part of the software product that is being maintained by back end engineers. And they need somebody with more data expertise to come and step in and help maintain the reliability of those pipelines. Gateways, like I've mentioned, with something like your data mesh, where you have data gateways that protect golden data sets, but allow your data to be accessed by a variety of teams of consumers, stakeholders, or other kind of data practitioner. And this is very different. This is not a data warehouse maintenance load. This is not a site reliability maintenance load. Who would manage this data set and make sure it stays reliable? This is what your data reliability engineer is trained to do. They have that skill set to understand what ingress and egress is, to understand ETL transformations both in streaming and batch, to understand storage varieties, from databases to large file stores, in memory stores, distributed stores, basic data governance and data analysis. They're not going to be your DBA and maybe be resharding databases, but they'll have a variety of skills that prepare them for handling data in motion as opposed to data at rest. They're going to focus on automating pipelines, automating services. They're going to be doing monitoring and observability on these data in motion systems and services, and they're going to understand modern architectures that are part of these systems, so that as they work with data engineers, they can create these contracts and slas that the practitioners need to maintain a quality of service. So that is the senior data data, senior data reliability engineer. You understand the importance of it. I was on a team that data reliability became essential part of the work. And these orchestration we were doing to provide our data practitioners with the data they needed to perform daily software tasks. It was very much fulfilling a contract, keeping your data fresh, not letting your data age out, keeping it viable, non erroneous. And it can quickly go from something trivial to something that is very much a requirement for software services to run successfully. I think it is applicable to all data centric services. Any service that relies on data to function needs to have some kind of data reliability in place as data is democratized and used across the stack. We see this with movements like data Mesh, where more and more teams need data and they need a variety of data sets. And it's not just your analytics teams, it is software teams, it is operations teams, it is finance teams. And as this data is democratized across transactions, reliability in that data becomes more crucial to every step of that process, so that nobody's getting calls because dashboards are outdated. But you can know in advance and have preemptive steps in place like the SRE practices or handbook subscribes and a data reliability engineer is specialized and different from any other reliability engineer because they work with data stacks and with data engineers, and they understand the data ecosystem, which is very different than a lot of software ecosystems. There are different kinds of optimizations and choices and design and architecture that need to be made. I want to thank you guys for coming to my talk. I hope you enjoyed it and you can take away a desire to have maybe some new data reliability practices in your organization. If you have any questions about data reliability, where to get started, what to do, how to do it, please reach out to me. I'm available on Twitter, on LinkedIn. I do weekly Twitch teams. I'm happy to talk about that, brainstorm and go further. I want to thank Comp 42 again for allowing me to speak on this topic at this site Reliability engineering conference, and I want to thank any sponsors and most importantly the attendees. And I hope to see you guys again next time.

Miriah Peterson

Senior Data Reliability Engineer @ Weave

Miriah Peterson's LinkedIn account Miriah Peterson's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways