Conf42 Machine Learning 2025 - Online

- premiere 5PM GMT

Ethical Data Engineering: Addressing Privacy, Bias, and Fairness in AI Systems

Video size:

Abstract

In the age of AI, data engineers hold the key to building fair, private, and transparent systems. Discover how to tackle privacy risks and algorithmic bias through cutting-edge techniques like differential privacy. Learn how ethical data practices can shape AI’s future and build societal trust.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. I'm Aaron Subramanian. It's a pleasure to be here at Conti two to talk about one of the most urgent and timely topics in tech today. Ethical data engineering as AI systems increasingly influence our daily lives, the decisions we make as data engineers have deep societal impacts. My talk today will explore how we can design and operate data systems that respect privacy, eliminate bias, and uphold fairness. We are going to explore real world case studies and dive into practical techniques with the goal of leaving you with tools and insights you can bring back to your teams. And think about this as a blueprint for ethical, scalable data systems that actually work in practice. We are living in an era. Of exponential data growth by 2025, we expect to hit 175 terabytes globally. That's a fivefold increase since 2018. But more data doesn't just mean better insights, it also means greater risk. Take for example, Facebook's Cambridge Analytica scandal. Millions of users personal data was harvested and used to influence selections That wasn't just a model failure. It was a data engineering failure, a breakdown in governance, consent, and ethics. As engineers, our responsibilities are expanding. It's no longer acceptable to say I just build the pipeline. We are also responsible for understanding the societal implications of the systems we create. So nearly half of engineering teams are working with generative AI reporting privacy issues specifically. That model can output sensitive information or identifying information. This isn't just a technical problem, it's an ethical and legal one. For instance, a chat bot trained on internal company data once leaked social security numbers. When prompted cleverly, we can't rely on black box models to do the right thing. We must enforce privacy at the data level. This is where the techniques like differential privacy. Federated learning and careful data. Labeling comment bias in AI is not limited to model design. It can creep in at every stage of the data pipeline, from collections to pre-processing, to labeling. Let me give you a powerful example. The recruiting algorithm trained on resumes ended up penalizing applications that include the word women's asin, women's just captain. Why? Because the training data reflected past hiring PA patterns that favored men. This shows how deeply bias can bake, be baked into data. We must actively design processes that detect and correct such issues before they each production. Let's look at two main strategy strategies for privacy preserving techniques. One being like differential privacy, which offers mathematical guarantees by adding noise to outputs. For example, apple collects usage statistics from your phone using epsilon. Values between two and four. This adds enough noise to mask individual data while preserving aggregated trends. Meanwhile, Google's wrapper systems uses values as low as 0.3 to protect search and usage data. These are real world scalable systems. On the other side, data anonymization techniques like K anonymity, ensures that any individual data is indistinguishable from at K minus one. Others. Combining these approaches, anonymization plus differential privacy, offers much stronger privacy protections. So now let's talk about how are we gonna address algorithmic bias and how do we fight it systematically? First, treat fairness checks as an ongoing process, one time audit, third work. Second, apply intersectional analysis. For example, if credit scoring model might appear fair by gender or race independently, but still disadvantage black women due to the intersection of race and gender. Only 29% of the organizations currently test for this third use automated tools to flag imbalances. In feature correlations and model outputs in healthcare general fairness, metrics missed clinical relevant biases in 47% of applications. Domain specific understanding is key. Now let's look at a case study of eCommerce. Let's explore on this specific to recommendation engine for which was performing worse for female users. On average, women received suggestions that were 18% less aligned with their preferences than men. After digging in, they realized the training data was skewed towards male purchase behavior. They correct this using stratified sampling, fairness aware, feature engineering, and model retraining. The disparity was already reduced significantly, and customer satisfaction scores improved across the board. This illustrates how we even seemingly neutral systems like recommendation engine scandal reflect, reinforce bias if we are in careful in. Let's go through the case study of healthcare. Arguably the most sensitive domain for data ethics. Predictive analytics model designed to forecast hospital readmissions was underperforming for non-white patients. The root cause. Under representation in the training data, the team redesigned their data collection pipeline implemented targeted outreach and stratified sampling. Non-white representation rose from 24.7% to 49%. They then applied federated learning to protect patients' privacy across five institutions. And used synthetic data generation by gans to supplement rare subgroups. These changes cut performance disparity by 68%. That's not just fair, it's clinically better. Now let's talk about ethical gates. How do we implement them? Ethics is in test separate. Separate step. It needs to be ba baked into every stage of the pipeline. Think of it as a four gates collection gate. Are we collecting data ethically? With consent and appropriate representation processing it or our transformation introducing bias output gate, are the outputs privacy, safe and fair monitoring it? Are we continuously evaluating deployed systems for emerging risk? This ethics by design model merits privacy by design, and helps prevent ethical failures beyond they occur. Now documentation and transparency. These are like two important things that we need to talk about when related to bridging the ethics and accountability. When we log data lineage from original source through transformations, we create a transparent system, but we must go beyond technical details. What tradeoffs were made, what biases were discovered, what limitations remain. For example, if your dataset excludes users under 18 due to constant rules, note that's a limitation that could affect general accessibility, tailor documentation for your audience. Technical teams want reproducibility. Regulators want compliance. End users want clarity, and let's talk about. Interdisciplinary collaboration. Ethical systems are built by diverse teams. Engineers bring technical rigor, ethicist translate principles to policies. Domain experts provide contextual nuances. For instance, in an education app, collaboration with teachers revealed that an A tutor was reinforcing gender stereotypes in math problems. This wouldn't have been caught by engineers alone. End users, especially from vulnerable communities, must also have a seat at the table. They can flag issues other misses because they live with the consequences. And let's talk about how are we plan to automate this whole ethical checks? Because manual checks are not scalable, we need an automated system that accesses beneficiaries. Does the system produce net positive impact non maleficence? Does it avoid harm, autonomy, or users are in control? Justice. Are outcomes fair across groups? Applicability? Can we explain decisions? Tools like IBMA, fairness 360, and Microsoft's offer promising capabilities, but they're not silver bullets. We must ensure they are aligned with real world values and validated in production settings, not just test beats. Now let's talk about regulations. Regulations like GDPR. HIPAA and UA Act are forcing companies to treat ethics as a compliance issue. But the best companies go further. They embed governance by design, build pipelines that are flexible enough to adapt to future rules. They maintain audit trails that demonstrate what decisions were made and why, and most importantly, they foster a culture where ethics is part of engineering, not an afterthought imposed by legal teams. To wrap up, ethical data engineering is about more than avoiding risk. It's about trust. It's about designing systems that are not just technically sound, but socially responsible with privacy, preserving tools, fairness frameworks, rigorous documentation and interdisciplinary collaboration, we can build data infrastructure that ends and deserves the trust of users and regulators alike. Thank you for your time. I look forward to your questions and conversations.
...

Arun Vivek Supramanian

Senior Data Engineer @ Amazon



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)