Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
Welcome to this session on Privacy by Design for JavaScript Data Systems.
I'm excited to share how we can build secure and compliant architectures
that respect user privacy while maintaining full system functionality.
Allow me to introduce myself.
I'm re I'm a data engineer at Petta.
I specialize in privacy centric data architectures and real
time data streaming systems.
I spent many years at Meta.
Building architectures and systems that are privacy aware, both in data
infrastructure and data processing.
Today in this conference, we're gonna explore the critical intersection
of privacy, compliance, and modern JavaScript data engineering.
This is not just about following regulations.
This is about fundamentally rethinking how we architect our systems.
And as a quick disclaimer, what I'm about to share is in no way reflection
of how we build systems at matter.
This is purely out of my experience in building data systems that are
privacy compliant and privacy aware.
Without further ado, let's get started.
We're talking about why this matters.
Currently, we have about 140 countries which have strong privacy regulations
and the average cost of data breach.
Is in the millions, about $5 million in 2024, and that's a pretty price, pretty
high price for any firm to take on.
And the scary part is out of the data that is, that was lost or that was
stolen 80% of 87% of US citizens can be identified using simply three attributes
like zip code, date of birth, and gender.
And this is not some theory, this is actually happening right now,
and privacy is no longer optional.
It's a technical and business requirement for any modern Java data script systems.
So what exactly is privacy design?
It's a proactive approach that embeds both privacy protections into
the system, architecture from the ground up, and privacy compliance.
And for JavaScript developers, this means integrating privacy controls through
the entire data engineering lifecycle.
From the moment data is collected on the client side through the storage
in no JS packets, all the way to processing and analytics, privacy
must be baked in at every step.
There
are four core principles that I wanna discuss.
The first thing is proactive and not reactive.
We anticipate and prevent privacy issues even before they occur.
The second one is privacy.
As a default, we choose to set in maximum privacy controls in the
system architecture automatically.
Without user intervention, we do not expect them to come and choose that.
We would rather wanna enable that from the get go.
The third thing.
Privacy is embedded into the design itself.
As in as you get started with the design of the product.
That's when you start thinking about privacy.
And it's not an afterthought.
It's not something that you add on at the tail end because we
all know that doesn't scale.
And finally, full functionality, because we're all passionate
developers and engineers and architects, and we don't want to
compromise the product functionality.
So we'll be talking about how to do this.
Privacy centric as well as keeping full functionality, keeping the product usable.
Let's dive into some concrete techniques.
The first one is K annuity.
K annuity ensures that every record in your dataset is indistinguishable
from at least K minus one.
Other records say if you choose your K two B five, then any individual data.
Individuals data looks identical to four other individuals, and this prevents
hackers or users from, when I say user system users, from tracing
back that individual and figuring out their demographic traits.
And the other concept is l Diversity, which extends this
by ensuring diverse, sensitive attribute values within each group.
So not only are the people, group together.
But each group has varied characteristics.
So this is essential for healthcare data especially, or any sensitive
protective data where for example, in healthcare setting, you wouldn't
want all the patients sharing same medical condition in one group.
If that group's data is lost, then that would defeat the purpose.
So you wanna.
Scatter it all over so it's not easy to trace back to, to a specific individual.
And next, how does this differential privacy look in practice?
And differential privacy is where math gets very interesting.
So you add in this, we basically add calibrated mathematical
noise to carry results.
So we ensure that the presence or absence of any single individual does
not sig significantly affect outcomes.
And Epsilon values.
This is a defense of privacy concept between 0.1 to 1.0, give
strong privacy gies while still maintaining statistical utility.
If you go beyond 1.0, you might be adding too much noise and you cannot trace back,
between 0.1 and one will give you enough salting into the data and that will still
keep statistical utility of the data.
And the key insight here is that individual records become impossible
to identify and aggregate patterns become become clear or remain clear.
And finally, for JavaScript implementations, we have libraries
like no differential privacy for no js that can implement local differential
privacy directly in the browser.
And this enables privacy, safe analytics without sacrificing the
the insights your business needs.
The next type of privacy that I wanna talk about is
pseudonymization and tokenization.
Pseudonymization and tokenization are your bread and butter for handling
PII with pseudo anonymization.
You are gonna replace identifying fields with pseudonyms while maintaining
referential integrity for analytics.
Think about replacing, an SSN with an internal nont traceable ID that won't,
that cannot be associated back to a user if the data gets lost or gets stolen
or gets accessed by an a stakeholder who is not supposed to access that.
And tokenization pushes this further by generating random
tokens that map sensitive data stored in a separate secure vault.
So this is even more secure.
And the beauty is that your analytics pipeline never touches real PII.
So basically, the analytical pipeline is running on those randomized tokens
that got generated and let's, analysts still derive meaningful results, but
when you really have to access that, then you can bring that data for m accesses.
You can bring that data and join that back to the to the tokens and, let's
talk about some advanced technologies.
For those who are ready to push boundaries homomorphic encryption let's
you perform computations and encrypted data without ever decrypting it.
So imagine running machine learning models that maintain up to 95% accuracy while
the data remains encrypted throughout.
Secure multiparty computation enables multiple organizations to jointly
combine their combined data without anyone seeing others' information.
Think about something to the effect of blockchain.
And these are actually being implemented today in healthcare and finance.
And I literally wrote a paper on healthcare analytics through
differential differential privacy and homomorphic techniques.
Let's talk about healthcare because that's definitely an interesting case study.
In, in, in one of the case studies that I had done, we had worked with
the healthcare consortium that we needed to enable medical research
across multiple institutions without any exposing patient data.
And using JavaScript based federated learning with differential privacy.
Patient records never left hospital servers, so ML models were trained
locally on encrypted data and only differentially private.
Model updates were shared centrally.
And the result is a fully hippy and GDPR compliant data sets while
enabling breakthrough research.
And this was keeping data local and sharing only privacy preserved insights.
And think about this in terms of of, the amount of research that can
happen, the amount of change that can be brought about in sensitive areas
like cancer research or other hip hop protected categories where you do
want, doctors and hospitals to share their knowledge across, organizations
while still protecting data privacy.
And you wouldn't wanna share this data as it is.
So this is where we are.
We've explored a concept and I've written a paper on it of how do you implement
models locally in, in a hospital, but share that results, share the differential
results of the hospital the aggregates or the model results to, to to the next
one so they can build on top of it.
Let's talk about one more case study.
And this one is financial services.
In financial services.
We implemented another tokenization at the point of payment collection.
So card details are immediately tokenized as they're collected.
Enable, enabling real time fraud detection without exposing actual card numbers.
So we basically token as that information and send it to the
back office for quick real time analytics using streaming pipelines.
And at the same time customer behavioral analytics are non key
aggregations providing valuable insights while protecting user privacy.
Imagine you're, you are, you're using a third party company or to
run analytics on your customer.
You're sharing with credit score companies or anybody like that where you don't
wanna share your data, but you don't want to share it at an individual level.
This is where a key ization or s clustering or, they come in handy.
And, if you create a cohort of 25, 50 people that'll protect
your your data while while still giving you great insights into
customer behavior and analytics.
So let's talk about the architectural blueprint for oh implementing this.
The first thing is the layer one is data collection.
So we implement consistent concern management and minimize collection
scope and immediately apply pseudonymization using browser APIs.
So right as you bring data, you are very cognizant of what
you're bringing into the system.
Instead of just bringing all kinds of data.
The second layer is storage, and here where we encrypt all the data
that is that is sitting addressed and we are separating, identifying
information into different databases.
You have one that is like truly sensitive pi i a that sits separately
from the nonsensitive data, like transactional records or whatever that is.
And you ensure that there are strict access controls that are
delineating data across both systems.
Clear three is processing and you're applying differential privacy across
all queries and using secure and place or sensitive operations.
You're not just like running user level information, you're trying
to aggregate care ization K means clustering or any of these techniques
that we have discussed so far.
And for the analytics, you're only generating insights
from anonymized data sets.
You're going back to your storage layer too, that you're only extracting or using
the data from the de-identified data.
And you're main maintaining comprehensive audit trails.
And this really comes in handy especially for one, if there is unfortunate
breach or two when you are getting a third party to certify your system.
Especially this is, for.
Folks in banking and insurance.
This the best and also healthcare.
The layer five is access.
You are, you're enforcing a role-based access control system where you
are ensuring that there is purpose, limitations, and you are managing
your user rights thoroughly.
And in this one, the most important thing is principle of least privileges.
You don't want to give people access to data that they
don't really need access to.
And then let's talk about the JavaScript tools and patterns for implementation.
You've got powerful tools at your disposal and nojs use the built in crypto
module for encryption and no different to privacy for implementing TP and JSO
Web token for secure token management.
And on the brows browser side, you have a subtle crypto, API
for client side encryption.
You can implement local difference and privacy before data leaves
the browser and use privacy.
Preserving analytics, SDKs, architecturally adopting privacy first,
API design patterns, implementing zero knowledge authentication where
possible, and considering federated data processing to keep data distributed.
This is definitely a lot of information on this slide that I, went through,
but you'll have access to this deck so you can go back and reference it.
And finally let me wrap this up with common pitfalls to, to avoid
firstly logging PIA in plain text.
Your application logs, error messages, debug, output, or offense showed
indefinitely and are searchable.
Once stack trace with user data can undo all your privacy works.
Secondly, weak ization using MD I assets, hashes or sequential id.
Isn't reallys anonymization?
It's just a sophistication.
These patterns are trivially reversible and I've, I'm yet to see a system, where
we, where people don't confuse or an organization where people don't confuse,
hashing with encrypting they're traceable.
They can be laid back.
And third thing is our collection mindset.
Trying to grab every data on the get go without figuring out a plan of how you
plan to use it or how you try to protect it is a very bad precedent to start with.
Think about it very proactively.
Think about what you want to do with that data before you even collect that data
and be very methodical about doing this.
Fourthly implementing privacy only on the client side.
Browser based controls are important, but never sufficient.
Any, and any developer with their tools can bypass them.
So ensure that you're implementing both on the client side and on the
server side or the backend side.
And fifthly ignoring data retention, keeping your data forever, just in case.
While it's many privacy regulations including GDPR, California Privacy
Raw, and not just only that and in an unfortunate incident, all that data
that's been sitting around that is is much more expensive if a, if an
intruder gets access to that data.
And the good thing is all these things can be absolutely prevented with proper design
with the techniques that, that we have discussed on this presentation so far.
And lastly thank you for sorry one more slide, key takeaways.
So let me leave you with four other key takeaways.
The first thing is embed privacy early.
Retrofitting privacy into an existing system is expensive on whiskey.
Design it from day one.
Second thing is layer your defenses.
No single technique is strong enough.
Third one, balance privacy with utility only.
Extract data where you need and ensure that you're protecting
whatever you extracting.
And fourth, stay compliant by design.
Privacy by design principles naturally align with all the privacy laws out
there like G-D-P-R-C-C-P or hipaa and a bunch of other regulations.
And good architecture makes you compliant by a natural effort and not as a burden.
And finally, thank you so much for the opportunity for allowing me
to speak with you and I hope you.
Took something valuable away from this conference and from this presentation.
And good luck with you on your future.
Diverse.
Thank you.