Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
I'm Aaron Subramanian.
It's a pleasure to be here at Conti two to talk about one of the most
urgent and timely topics in tech today.
Ethical data engineering as AI systems increasingly influence our daily
lives, the decisions we make as data engineers have deep societal impacts.
My talk today will explore how we can design and operate data
systems that respect privacy, eliminate bias, and uphold fairness.
We are going to explore real world case studies and dive into practical
techniques with the goal of leaving you with tools and insights you
can bring back to your teams.
And think about this as a blueprint for ethical, scalable data systems
that actually work in practice.
We are living in an era.
Of exponential data growth by 2025, we expect to hit 175 terabytes globally.
That's a fivefold increase since 2018.
But more data doesn't just mean better insights, it also means greater risk.
Take for example, Facebook's Cambridge Analytica scandal.
Millions of users personal data was harvested and used to influence selections
That wasn't just a model failure.
It was a data engineering failure, a breakdown in
governance, consent, and ethics.
As engineers, our responsibilities are expanding.
It's no longer acceptable to say I just build the pipeline.
We are also responsible for understanding the societal
implications of the systems we create.
So nearly half of engineering teams are working with generative AI
reporting privacy issues specifically.
That model can output sensitive information or identifying information.
This isn't just a technical problem, it's an ethical and legal one.
For instance, a chat bot trained on internal company data once
leaked social security numbers.
When prompted cleverly, we can't rely on black box models to do the right thing.
We must enforce privacy at the data level.
This is where the techniques like differential privacy.
Federated learning and careful data.
Labeling comment
bias in AI is not limited to model design.
It can creep in at every stage of the data pipeline, from collections
to pre-processing, to labeling.
Let me give you a powerful example.
The recruiting algorithm trained on resumes ended up penalizing
applications that include the word women's asin, women's just captain.
Why?
Because the training data reflected past hiring PA patterns that favored men.
This shows how deeply bias can bake, be baked into data.
We must actively design processes that detect and correct such
issues before they each production.
Let's look at two main strategy strategies for privacy preserving techniques.
One being like differential privacy, which offers mathematical guarantees
by adding noise to outputs.
For example, apple collects usage statistics from your phone using epsilon.
Values between two and four.
This adds enough noise to mask individual data while preserving aggregated trends.
Meanwhile, Google's wrapper systems uses values as low as 0.3
to protect search and usage data.
These are real world scalable systems.
On the other side, data anonymization techniques like K anonymity,
ensures that any individual data is indistinguishable from at K minus one.
Others.
Combining these approaches, anonymization plus differential privacy, offers
much stronger privacy protections.
So now let's talk about how are we gonna address algorithmic bias and
how do we fight it systematically?
First, treat fairness checks as an ongoing process, one time audit, third work.
Second, apply intersectional analysis.
For example, if credit scoring model might appear fair by gender
or race independently, but still disadvantage black women due to the
intersection of race and gender.
Only 29% of the organizations currently test for this third use
automated tools to flag imbalances.
In feature correlations and model outputs in healthcare general fairness,
metrics missed clinical relevant biases in 47% of applications.
Domain specific understanding is key.
Now let's look at a case study of eCommerce.
Let's explore on this specific to recommendation engine for which was
performing worse for female users.
On average, women received suggestions that were 18% less aligned with
their preferences than men.
After digging in, they realized the training data was skewed
towards male purchase behavior.
They correct this using stratified sampling, fairness aware, feature
engineering, and model retraining.
The disparity was already reduced significantly, and customer satisfaction
scores improved across the board.
This illustrates how we even seemingly neutral systems like
recommendation engine scandal reflect, reinforce bias if we are in careful
in.
Let's go through the case study of healthcare.
Arguably the most sensitive domain for data ethics.
Predictive analytics model designed to forecast hospital readmissions was
underperforming for non-white patients.
The root cause.
Under representation in the training data, the team redesigned their data
collection pipeline implemented targeted outreach and stratified sampling.
Non-white representation rose from 24.7% to 49%.
They then applied federated learning to protect patients'
privacy across five institutions.
And used synthetic data generation by gans to supplement rare subgroups.
These changes cut performance disparity by 68%.
That's not just fair, it's clinically better.
Now let's talk about ethical gates.
How do we implement them?
Ethics is in test separate.
Separate step.
It needs to be ba baked into every stage of the pipeline.
Think of it as a four gates collection gate.
Are we collecting data ethically?
With consent and appropriate representation processing it or
our transformation introducing bias output gate, are the outputs
privacy, safe and fair monitoring it?
Are we continuously evaluating deployed systems for emerging risk?
This ethics by design model merits privacy by design, and helps prevent
ethical failures beyond they occur.
Now documentation and transparency.
These are like two important things that we need to talk about when related to
bridging the ethics and accountability.
When we log data lineage from original source through transformations, we
create a transparent system, but we must go beyond technical details.
What tradeoffs were made, what biases were discovered, what limitations remain.
For example, if your dataset excludes users under 18 due to constant rules,
note that's a limitation that could affect general accessibility, tailor
documentation for your audience.
Technical teams want reproducibility.
Regulators want compliance.
End users want clarity,
and let's talk about.
Interdisciplinary collaboration.
Ethical systems are built by diverse teams.
Engineers bring technical rigor, ethicist translate principles to policies.
Domain experts provide contextual nuances.
For instance, in an education app, collaboration with teachers revealed
that an A tutor was reinforcing gender stereotypes in math problems.
This wouldn't have been caught by engineers alone.
End users, especially from vulnerable communities, must
also have a seat at the table.
They can flag issues other misses because they live with the consequences.
And let's talk about how are we plan to automate this whole ethical checks?
Because manual checks are not scalable, we need an automated
system that accesses beneficiaries.
Does the system produce net positive impact non maleficence?
Does it avoid harm, autonomy, or users are in control?
Justice.
Are outcomes fair across groups?
Applicability?
Can we explain decisions?
Tools like IBMA, fairness 360, and Microsoft's offer promising capabilities,
but they're not silver bullets.
We must ensure they are aligned with real world values and validated in
production settings, not just test beats.
Now let's talk about regulations.
Regulations like GDPR.
HIPAA and UA Act are forcing companies to treat ethics as a compliance issue.
But the best companies go further.
They embed governance by design, build pipelines that are flexible
enough to adapt to future rules.
They maintain audit trails that demonstrate what decisions were
made and why, and most importantly, they foster a culture where ethics
is part of engineering, not an afterthought imposed by legal teams.
To wrap up, ethical data engineering is about more than avoiding risk.
It's about trust.
It's about designing systems that are not just technically sound, but socially
responsible with privacy, preserving tools, fairness frameworks, rigorous
documentation and interdisciplinary collaboration, we can build data
infrastructure that ends and deserves the trust of users and regulators alike.
Thank you for your time.
I look forward to your questions and conversations.