Data Quality and Validation in ML Pipelines

Video size:

Abstract

In machine learning, data quality isn’t just a nice-to-have—it’s make or break. Bad data can silently derail your models, leading to poor predictions, wasted resources, and lost trust. In this talk, we’ll explore how to bring data validation into your ML pipelines using three powerful open-source tools: Great Expectations, Deequ, and TensorFlow Data Validation. We’ll look at how each tool helps catch issues like missing values, schema drift, and unexpected data distributions before they become bigger problems. You’ll see how they work, where they shine, and how to choose the right one for your workflow—whether you’re building batch pipelines, streaming systems, or end-to-end ML platforms. If you care about building reliable, production-ready ML systems, this session will give you the practical tools to keep your data in check.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. My name is Neil Kumar Muso. I work in AI data engineering. Today I want to talk about something really important in mission learning, making sure our data is clean and correct. We will look at these three useful tools that help with this great expectations, DQ and TensorFlow data valuation. This slides talk about why data quality matters in ml. Machine learning needs good data to work well. If the data is messy or wrong, the model can make bad decisions. Even small mistakes in data can cause big problems, so we cannot just focus on building the models. We can also need to check our data. The common data problems. Here are some problems we often see in data missing values. Number saved as text and data format changing over time. High, very high or strange values, duplicate rules. This all can hurt. Model's performance. The data validation means checking if the data is clean and correct. Before we train a model. It helps us finding a problem early. We can also set up to do these checks automatically. Let's look at these free tools that can help to check our data. Great expectations, dq TensorFlow, data validation. Each tool is good for different situations and teams. Great expectation is a tool in Python. You can write a sample rules for what you, your data should looks like. It works with Pandas, SQL, and Spark. It also gives a clear and nice look reports. It works better for Python users, handling small to medium size of data sets, but it. It is not ideal for very large or real time data. DQ is built by Amazon and runs on Apache Spark. It's good for checking big data and data that change a lot. It can also track your data quality over time. It works best for AWS users, big data or teams using Spark. But it is not ideal for small projects or beginners. This is my favorite tool, which is TensorFlow data validation. TensorFlow data validation is part of a TensorFlow. It checks your data before training a model, and it is good for dataset and can find a problem automatically. It works better for TensorFlow users, but not for people using other ML tools. This slides compare the tools. Great expectation, easy to use. Work well with Python. DQ is best for big data or AWS Spark setups, but TensorFlow validation is best for TensorFlow or production pipelines. So choosing the right tools, using a great expectation for flexible rules that, and also clear the reports using a D dq. If your data is big or always changing, this is the best tool to use, and if you are using a tensor flow, then you definitely have to definitely use TensorFlow data validation. This is perfectly fit when you're using a TensorFlow. Here are the key takeaways. Let's remember. The good data is just important. As good model, bad data can silently hurt your results. We should use tools to check data automatically. Great expectations, DQ or TensorFlow, data validation all helps in a different ways. Here is a conclusion. Good data equal to a better model. Pick the tool that works better for your team and your setup. Make sure to check your data early and often. Thank you for this wonderful opportunity. If you have any questions or would you like to discuss these tools more in detail, feel free to connect me on LinkedIn. Thank you again. Have a nice day.

Slides

Download slides (PDF)

See all 137 talks at this event!

Newsletter

$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Email address

First Name

Last Name

Company

Job Title

Phone Number

Country

Community

$ 8.34 /mo

Access to Circle community platform

Immediate access to all content

Live events!

Regular office hours, Q&As, CV reviews

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Data Quality and Validation in ML Pipelines

Video size:

Abstract

Summary

Transcript

Slides

Sunil Kumar Mudusu

Lead AI Engineer/Data Engineer @ Church Mutual Insurance Company

Join the community!

Featured event

2025

2024

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Data Quality and Validation in ML Pipelines

Video size:

Abstract

Summary

Transcript

Slides

Sunil Kumar Mudusu

Lead AI Engineer/Data Engineer @ Church Mutual Insurance Company

Join the community!