Data QA Nightmares: A Journey from Broken Dashboards to Observability Maturity

Video size:

Abstract

Our dashboards looked great until they started lying. In this talk, I’ll share how hidden data issues, silent pipeline failures, and schema drift broke our analytics and how we fixed it by combining observability with automated QA.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. I'm Chidi and I'm here to talk about something that's haunted. Many data teams like broken dashboards that look fine, but silently lie. In this talk, I will walk you through our journey from those chaotic moments to building a mature observability framework for data quality assurance. This isn't just about testing, it's all about regaining trust in our data. By building an observability QA culture problem statement, our dashboards looked polished, but under the hood we have the data issues like silent pipeline failures and schema drift, which leads to misleading KPIs. By the time stakeholders flagged inconsistencies, it was often too late. This created confusion, delayed decisions. And Erod a trust in analytics. Shiny dashboards, shaky truth. We had sleek analytic dashboards that suddenly began giving incorrect insights. What you see isn't always what you get. Our data looked fine on the surface, but was rotten underneath leading to bad decisions and lost trust. There were hidden data issues. Undetected schema changes and silent pipeline failures introduced errors into these metrics. These issues were hidden behind the scenes with no immediate visible signs. By the time someone noticed damage was already done, and the last is eroding trust. Stakeholders grew vary of the data. Once a dashboard tells a lie, every report becomes a suspect. Data teams morale clenched. As more time went into fight fighting than delivering value. In fact, industry surveys show data downtime, which can consume over 30% of their data teams time and erode confidence in analytics. And here comes see real production pain points, delayed detection and firefights. Because issues weren't quarterly, we often operated in reactive mode by discovering errors only when a stakeholder raised a concern, this meant frantic late night troubleshooting sessions fixed on the fly patches and data fire drills that burned out. The team eventually brittle validations at scale. The few data quality rules we had would often break whenever data volume spiked or new fields were added. They weren't scalable at all. Writing exhaustive tests for thousands of data assets was impractical with a traditional approach. The result, blind spots where bad data could slip in unchecked, silent pipeline failures. In some cases, jobs failed or only partially succeeded without load errors. Data stopped updating our certain records didn't load, and we didn't find out until days later when someone noticed a bi bizarre trend. And then next comes POO test coverage. Our existing data checks were minimal and brittle, so we might test a couple of critical tables or use basic QA scripts, but many assumptions went untested with hundreds of tables. We simply didn't cover everything. And the last is undetected schema drift. Small upstream changes like a renamed column or change data format broke our data silently. Pipelines kept running, but key fields got misaligned. We had no immediate alerts for schema changes, so reports quietly went wrong. Moving on to wide traditional qf, failed pipeline, not equal to software. Traditionally, data QF was treated as a one-time checkbox. We would often test queries or ETLs during development and assume if the pipeline ran without error, the data must be fine. Unlike application qa, there is little continuous testing once a code has been moved into production, but that's a grail mistake. Incomplete testing data is fluid and evolving, but our tests were static. We checked a few sample outputs or verified logic on day one, but as data changed over months, those tests didn't adapt. We lacked coverage for new edge cases and evolving data patterns not built for scale. Maintain data quality by manually scripting tests for every table was unsustainable. If you have thousand plus tables, writing even five simple checks per table means thousands of tasks to code and update, which is clearly unrealistic. Traditional QA approaches buckled under the volume and complexity of modern data QA processes in silos. Our QA happened mostly before deployment, separated from monitoring. Once pipelines were live, there was a gap. We had no built-in feedback loop from production. Data behavior. Back to qa. The first indicator of trouble was often a user complaint that read it Monday, 8:00 AM Slack message about the broken data. In practice, a lot of QA relied on manual checks, eyeballing a report, or running a SQL snippet to spot check the results. This doesn't scale our cat settle issues because human eyeballs can't watch thousands of metrics in real time. In summary, our old QA approach was too shallow and too slow. It failed to catch the very issues that mattered because it wasn't continuous automated or intelligent enough. We needed a new start. Moving on to Pvo two observability, talking about the light bulb movement. Our turning point was realizing that we had to treat the data pipelines like production software in software engineering. You wouldn't deploy an app without monitoring and logging. Why were we running data systems in the dark? This lead us to adopt observability principles for data qa. But what actually is observability? It's our ability to fully understand what's happening inside a system by looking at its outputs, be it logs, metrics, and traces in a data context. Observability means instrumenting pipelines, so we can monitor data health end to end instead of assuming the data is fine, we actively watch for. Signs of trouble. And in the new approach, we treat data as a code where we began adding telemetry everywhere, logging each stage of the ETL, measuring row counts and crossing times and tracing data flows from source to the dashboard. Now, when something went wrong, it wouldn't be silent. We would have breadcrumbs to follow. And the real time alerts where observability lettuce detect failures and anomalies before users do. For example, we set up automated alerts for when a daily load dint run, or when a metric deviated beyond the normal range. No more waiting weeks to discard a problem. The system would tell us as an article, put it without observability. You're essentially blind to the data failures, proactive QA via monitoring. This P vote was essentially blending QA with operations. Instead of QA being a one-off gate, it became an ongoing process of monitoring data quality in the production. Our data quality checks now ran continuously, and any violation triggered the same kind of response as a site outage would be in DevOps. We moved from firefighting to early detection and prevention. Now the key principles of observability in data qa, we added pervasive visibility to instrument everything, be it locks, metrics, traces, every pipeline step. Now emits locks and metrics, which are quantitative measures like row counts, latencies, or error dates. We also implemented tracing to track data flow across services. These three pillars, locks, metrics, and traces are the foundation of observability. They let us ask what's happening inside our data pipelines at any given point in time? Fail fast. Fail safe. Finally, we embraced a philosophy from software testing Fail fast. If the data is bad, better to stop the pipeline or flag it immediately than to quietly push 40 data onward. By halting and errors and rerouting to a recovery process. We prevent bad data from polluting reports. In practice, this meant designating certain tests as critical. If they fail, they stop the line and alert preventing the downstream harm. Data lineage and impact analysis. A key principle we adopted is maintaining data lineage visibility. Lineage is a map of how data flows from source to the target. When a problem arises in a data set, lineage tells us what upstream source might have caused it and which downstream reports might be affected. This is critical for rapid root cause analysis and for deciding who needs to be involved. We built lineage tracking so that no data issue remains a mystery as we can trace it across the pipeline. Data quality checks as code. We embedded automated tests directly into our pipelines using tools like DBT or great expectations. We define the expectations. Example, no one else in critical columns. And the transaction amount should be between dollar and $10,000 that run whenever data is updated. If data violates an expectation, the pipeline can fail fast rather than propagate the bad data. QA isn't a separate fees anymore. It's baked into each data transformation. Continuous monitoring of the data health. We monitor key data, health indicators 24 by seven. This includes freshness. Like, is data up to date the volume? Did we get all the records or some are missing distribution. Our values within the normal ranges and schema changes. Did someone add or remove a column? By tracking these, we catch anomalies like sudden drops in row counts or unexpected null spikes immediately. Alerting and incident response. Observability isn't just about passive monitoring, it's active alerting. We configured threshold based and anomaly based alerts that pays the team when something goes off. For instance, if a daily sales table usually has a hundred K rows and today has zero and alert fires, if an ETL job doesn't run by its scheduled time, we get notified this way. Data issues trigger a response, just like application downtime would. Our goal is to be the first to know about the data issues, not at the last. Moving on to the tools and techniques, used, great expectations, csar go-to framework for the data validation. Great. Expectations let's us define the expectations about our data, like constraints or quality rules. In a read WA, we implemented great expectation checkpoints in our pipelines. Say for example, after data load, we automatically verify the row counts. Valid ranges, NU percentages. This catches bad data pretty much early rather than at the latest stage. And the next is DBT, the data build tool. We leverage DBT not only for transforming data, but also for its testing capabilities. DBT allows SQ SQL tests on your models. We wrote tests to ensure dimension tables aren't missing keys, or that reference data matches expected values. DBTs integration with version control meant our data tests are code reviewed and versioned, just like our transformations. It promotes the philosophy of treating data pipeline code with the same rigor as application code. And the next is open telemetry. We instrumented our custom data pipeline code with open telemetry for unified tracing and metrics. Open telemetry is an open standard for collecting telemetry data. By using it, we could export pipeline runtime metrics, like how long each stage took, how many records have been processed. And distributed traces that followed a data record from injection to output. This was huge for debugging. If a job was slow or failed, we could trace exactly where it has happened. And for monitoring and visualization, we use Prometheus and Grafana. All those metrics and locks had to go somewhere. We set up Prometheus to scrape and store pipeline metrics like row counts durations. Or error counts and Grafana dashboards to visualize them in real time. Our Grafana dashboards became our data pipeline control center showing the health of each data flow at a glance green. If all good, ready anomalies, for example, we have chats for daily record counts per table with alert thresholds if they deviate from historical norms. Our new process can be visualized as a pipeline with built-in quality gates and monitoring at every step. Here is a high level architecture, data sources, and ingestion. Data flows in from various sources, be databases, APIs, or files. As it's ingested, we generate logs and metrics. Example, source record counts, staging and validation checks. In staging areas, we run great expectation suites On raw data. Example, check that all expected columns are present and values. Make sense? If a critical expectation fails. Say a primary key column, nus. We flag it immediately and hold processing for that. Moving on to transformation and linear tracking. As data moves through transformation jobs, be it ETL or ELT, EJA is instrumented with tracing via open telemetry. We propagate a trace ID so that all events for a particular pipeline run are being linked at key transformation points. We include assertion tests via DBT tests or custom scripts to ensure business rules hold. For example, after joining tables with test, that row counts didn't unexpectedly drop. We also log metadata about schema versions and data statistics at each step, monitoring and alerting. All pipelines push their metrics to a monitoring system. For instance, after each load, a pipeline sends the row count and execution duration metrics to promeus. Similarly, any warnings or errors are locked our central log index with context, say for example, pipeline name or the data asset affected. This layer acts an observability hub aggregating signals from all the pipelines. BI layer and dashboards, the transform data lands in our warehouse. Example, BigQuery. Even here, we leverage built-in auditing features. For example, using BigQuery information schema to detect if any schema changes occurred or the number of bytes processed by each query, feeding those stats back into our monitor. When the BA dashboard finally queries this data, we have confidence that it passed all previous quality checks. The output dashboards now have an implicit trust score. If a dashboard is live, it means all the upstream data quality tests have been passed and no anomalies were detected in the pipeline. If something was wrong, either the data never made it here, or the dashboard is showing an alert annotation. We also built features to annotate reports if upstream data is stale or under investigation. Lessons learned and best practices, trust, but verify. We learned that assuming data is correct is very dangerous. Even if a pipeline doesn't crash, the data can be wrong. Now our motto is trust but verify always at every step. Build verification into your process so you are never blindly trusting a perfect looking dashboard. Observability is a team sport. Getting to observability maturity wasn't just a tech implementation, it was a cultural shift. We aligned data engineers, Q analysts, and even analysis on the idea that data qualities everyone's responsibility. I. Sharing observability dashboards with a wider team helped everyone rally around the data trust. Start with your pain points. We didn't implement everything overnight. We targeted the worst pain points. First example, a critical dashboard that kept breaking or a frequently changing source schema and put observability there. This yielded quick wins that built momentum. For instance, catching a major schema change in real time prevented a KPA outage immediately proving the value of this approach. The best practice is to iteratively build your observability, focusing on areas of highest risk, automate and standardize. Humans can scale to check hundreds of tables daily, but machine scan, we invested in automation for testing and monitoring. Okay. Templates and frameworks like reusable Great Expectation Suites and DBT test macros. That helped standardize RQA. This way, every new data sets get, say based on quality checklist out of the box. Standardization also means clear expectations. Everyone knows what past QA means in measurable terms. Three data incidents like software incidents, a big lesson respond to data problems with the same urgency and process as application outages. That means having oncall rotations, incident tracking, root cause analysis, and preventative follow-ups. For us, this led to systematic fixes. Example, after a data incident, we would add a new test or improving monitoring to prevent a repeat. Over time, this proactive stance greatly reduce the fight. Fighting observability is not a one and done project. We continuously refine our checks and alerts. We review false alarms to adjust thresholds and analyze misses to add new tasks. Our data and systems evolve, so our observability must evolve to you. One best practice is to hold regular data quality review meetings. To reliability review meetings to assess what's working and what isn't in our observability setup. Finally, we made it to a point to celebrate when our system catches issues. Every prevented bad dashboard or saved hours of troubleshooting is a win for the team and the business. This positive reinforcement kept the team motivated to maintain a high bar for the data quality. It also helped in getting buy-in from the leadership to continue investing in data observability. Nothing speaks louder than prevented crisis and call to action. Assess your data QA today. Take a hard look at your current data quality processes. Are you catching issues early or are you relying on end users to spa the mistakes? Identify one nightmare scenario in our organizations. Be it a dashboard that broke or a critical pipeline that failed silently. That's your candidate for an observ makeover. Embed and embrace observability if you haven't already. Start instrumenting your data pipelines. Even simple steps can pay off enabling logging if, if it's not on set up. The basic alerts for data freshness or volume changes. Write a few great expectation tasks for a key table. Treat your data pipelines with the same care as live products. Remember, data is a new software. It deserves the same level of monitoring and qa, and leverage the available tools. You don't have to build everything from scratch. Open source tools and frameworks are at your disposal. Try out great expectations to kickstart your data testing. Use your existing a PM monitoring tools or cloud monitoring tool, track pipeline jobs. Explore what your data warehouse offers. Many have information schema tables or built-in logging you can tap into The ecosystem is pretty rich. Pick a pain point and apply a tool to address it. Make it a team effort. Encourage a culture where data quality is part of the definition of done. Involve QA engineers in pipeline design and involve data engineers in quality monitoring. Perhaps create a data observ task force, RNA Pilot project. Share the improvements and insights with stakeholders. When people see more relatable dashboards, they'll become advocates for this approach. Thank you for giving me this opportunity. I.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Data QA Nightmares: A Journey from Broken Dashboards to Observability Maturity

Video size:

Abstract

Summary

Transcript

Slides

Chaitanya Reddy Krishnama

@ Moneygram Inc

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Data QA Nightmares: A Journey from Broken Dashboards to Observability Maturity

Video size:

Abstract

Summary

Transcript

Slides

Chaitanya Reddy Krishnama

@ Moneygram Inc

Join the community!