Conf42 Machine Learning 2024 - Online

GDPR and Beyond: Demystifying Data Governance Challenges

Abstract

Explore critical aspects of data governance, compliance, and development in the evolving data landscape. Learn how robust data governance addresses challenges and ensures compliance with global regulations. Join us for a concise and insightful session.

Summary

  • Antonio and Francesco are data architects at Agilelab, an italian consulting firm specializing in large scale data management. Today we will talk about data privacy and GDPR, why this european regulation is so important and we should design systems to be compliant with it.
  • Data is the fuel for innovation. Machine learning, artificial intelligence and analytics simply can't be possible without data. Data breaches are very similar to oil spills. Like environmental damage, data breaches can severely impact brand trust and loyalty.
  • From 2004 to 2020, 117 billion records were compromised. The web sector was the hardest hit, accounting for nearly 10 billion records lost. This growing trend underscores the critical need for robust data security measures. European Union came up with the General Data protection Regulation.
  • Data minimization principle is foundational to responsible data handling and privacy protection. Organizations should only collect and process the personal data that is absolutely necessary. By adhering to this principle, we not only streamline our data management practices, but also enhance security and compliance.
  • Data anonymization is one of the techniques that organizations can use in order to adapt restrict data privacy regulation. For each strategy we evaluated three main secrecy, privacy and utility. And for each capital strategy factor we assigned a rating ranging from poor to best.
  • Homomorphic encryption allows the ability to compute on data while the data is encrypted. Synthetic data generation is a critical and very complicated for two main reasons, quality and secrecy. Using real data increase stakeholder confidence in the testing process and the reliability of the development lifecycle.
  • Anonymized data with encryption reduce the risk of exposing personal information. Using anonymized data helps organization to comply with GDPR, CPI and so on. Working with encrypted data, especially in the case of AE's algorithm, can complicate development phase and also testing activities.
  • We leverage the medallion architecture for our storage layer. Data at each stage get richer by increasing the intrinsic value. The encryption process becomes a mandatory step in the data lifecycle. This is going to simplify the data movement and the orchestration process between environments.
  • crypto shredding is the practice of deleting data by deleting or overdriving bankruptcry. This is going to require that the data have been encrypted from deleting. This approach is very useful when you have multiple copy of data. Thank you everybody and let's get in touch from any question and answer.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
I am Antonio and today with Francesco we are going to present GDPR and beyond demystifying data governance challenges. Francesco and I are data architects at Agilelab, an italian consulting firm specializing in large scale data management. Agileab is an effective and dynamic company structured around an olacosy inspired model with multiple business units. Through these business units, we are lucky enough to have several Fortune 500 companies as our customers. Ok, lets take a look at what we have in the agenda. Today we will talk about data privacy and GDPR, why this european regulation is so important and we should design systems to be compliant with it. We will have an overview of different techniques that we can leverage to be compliant and secure like anonymization and encryption. Then we will compare these different techniques focusing our attention on pros and cons of each one. Finally, we represent a viable data sharing strategy for real use. Case data is an UI, am I right? Data is the fuel for innovation. Machine learning, artificial intelligence and analytics simply can't be possible without data. Just as oil power engines data fuels algorithms, enabling machines to learn and improve over time, this is the case for machine learning. Also. Data is the lifeblood of AI driving smart ecosystem that can mimic human intelligence. Finally, data analytics extracts valuable insights just as refining oil produces useful products. Yeah, data is the neural. Oil is the neural also in the bad parts of it. For example, data breaches are very similar to oil spills. They cause extensive damage and leak sensitive information that erode trust of our customers. We can also have privacy violations which are very similar to the pollution that can harm ecosystems. Privacy violation disrupts the digital environment and arms individuals. Then we have regulatory fines which are very comparable to environmental fines, which means that when your data is non compliance with data protection regulation, you will get very huge fines. Finally, you will suffer for reputational damage because like environmental damage, data breaches can severely impact brand trust and loyalty. So let's reflect on the reality of data breaches over the past 20 years. This visualization showcases the top 50 biggest data breaches. From 2004 to 2020, 117 billion records were compromised. As we can see, the severity of breaches is escalating during the years, particularly from 2020 2016 onwards, with the web sector being the hardest hit, accounting for nearly 10 billion records lost. Significant breaches span across various sectors including finance, government and tech, highlighting the widespread vulnerability. Notable breaches include Yahoo 2013 losing 3 billion records and Facebook in 2019 with 530 million records exposed. This growing trend underscores the critical need for robust data security measures. These breaches not only compromise personal information, but also erode public trust and pose severe financial risks. As we move forward, it is imperative to prioritize data protection and adopt stringent security protocols to safeguard these digital assets. Thats why European Union came up with GDPR. GDPR stands for General Data protection Regulation and is a regulation that requires businesses around the world to protect the personal data and privacies of European Union citizens. Starting from on the 25 May 2018, GDPR puts in place certain restrictions on the collection, use and retention of personal data. Personal data is defined as information relating to unidentified or identifiable natural person. This includes data such as name, email, phone number, in addition to data that may be less obvious like API addresses, gps location, phone id and more. GDPR is based on some key principles and we will briefly run first of all is lawfulness. Personal data must be processed legally, adhering to established laws such as GDPR. Then we have fairness. Data processing should be fair, protecting vital interest, performing tasks carried out in the public interest, or pursuing legitimate interest of the data controller or third party. Then we have transparency because organizations must be open about their data processing activities. They should provide clear, accessible and understandable information to individuals about how their data is being used, who is collecting it and why. And this is why you have the cookie in every european website now. Then we have purpose limitations. So personal data must be collected for specified, explicit and legitimate purposes and not further process it in a manner that is not compatible with those purposes. Then you have the data minimization principle. That means organizations should collect only the personal data that is necessary to achieve the specified purpose, and we will have more on this later. Then we have accuracy. So personal data must be accurate and kept up to date, and inaccurate data should be corrected or deleted. Storage limitation means that personal data should not be kept longer than necessary for the purpose for which it was collected. For example, an example of storage limitation is the right to be forgotten. If I ask to be forgotten by some company, they should delete all data about me. This is storage limitation. Then we have integrity and confidentiality. So organization must ensure that the security of personal data, protecting it against unauthorized or unlawful processing, accidental loss, destruction or damage like data breaches. And then you have accountability. So data controllers are responsible for complying with GDPR principles and must be able to demonstrate their compliance. In order to do that, GDPR creates some requirements around the regulation itself. It requires that companies have data protection impact assessments, which are tools used to identify and mitigate risk associated with data processing activity. They must follow data breach notification regulation so they should timely report data breaches to both authorities and individuals. Then they need to appoint a data protection officer, someone inside the company that ensure that there is a person responsible for overseeing data protection strategy and compliance with GDPR. Obviously, they need to implement data protection by design and by default. So every data initiative and the company should comply with this regulation without any need to integrate it after. And then they have some record keeping obligations. So they need to be sure the companies maintain a detailed record of their data processing activities. This for accountability and compliance purposes. Obviously, GDPR had huge implications for data governance. But what is data governance? Data governance is the process of managing the availability, usability, integrity and security of the data in enterprise systems based on internal standards and policy that also control data usage. An effective data governance ensures that data is consistent and trustworthy and doesn't get misused. GDPR had some implications on internal data governance strategies for companies or enterprises, such as they had to enhance data security and privacy control. To be compliant. They needed to improve data quality and accuracy. Because of the principle we've seen before, they need to put an increase to accountability and transparency in how data was used. And obviously, GDPR put pressure and created the necessity for regular audits and assessments around data. Today, we will focus mostly on the data minimization principle, which in our opinion is one of the most important ones in GDPR. Data minimization principle is foundational to responsible data handling and privacy protection. Under GDPR, data minimization principle mandates that organizations should only collect and process the personal data that is absolutely necessary for their specified purposes. Imagine you're building a house. You wouldn't order extra bricks that you'll never use as it would be wasteful and clutter your space. Similarly, in data processing, we should avoid collecting excess data. By adhering to this principle, we not only streamline our data management practices, but also enhance security and compliance. Collecting minimal data reduces the risk of breaches and misuse, ensuring we respect our customers privacy and build their trust. This also simplifies data management and can lead to more efficient processes. So let's commit to collecting only what we need, protecting privacy and fostering a culture of data responsibility. So, does this meme look familiar? We know that people working on AI, machine learning and analytics need real data to do their job, but this clashes with GDPR regulation 99% of the time. I will leave now the stage to Francesco, who will show you how we can build a compliant data sharing strategy and still allow data practice share to be effective. Here we go, Francesco. This meme will look familiar. This is a quite common scenario since most of ML engineers and data scientists need to prototype everyday their models. In the majority of the cases they start using development data, but the risk is that when moving to production, the performance of the model is low, so they fall back in using sampler data from production. This also increased the risk of sensitivity, data leakage and exposure in the minor environment. In the next slide, let's see how we can unfold this issue and which techniques could help our case. So the first thing we are going to talk about is about anonymization. So data anonymization is one of the techniques that organizations can use in order to adapt restrict data privacy regulation, but require the security of personal identifiable information such as health reports, contact information and financial details. It affects pronunciation since it is not a reversible operation. Cellular initiation simply reduce the correlation of data set with the original identity of a data subject and is therefore a useful but not an absolute security measure. And let's take a look now at the most common one technique about below the act of anonymization. The first one that we are going to talk about is generalization. Generalization usually changes the scale of the data set, attributes or the order of magnitude. As you may see, we have a simple table with several columns like name, age, birth date, state, and disease. In the example one, a field that includes number like age can be generalized by expressing an interval. As you may see, mark age has been put within an interval between 20 and 30. In the example two, a filtering class dates like 1993 1019 can be generalized by using only the year 1993, and this is the very first method. The other one is randomization. Randomization involves changing attributes in a dataset so that they are less precise while maintaining their overall distribution. Below the app of randomization, we have textbooks such as noise addition and shuffling. Noise addition methods provides to inject some modification within the data set in order to make it less accurate, for example, increasing or decreasing the age of a person, as we can see in our example inside the slide while shuffling, simply swap the the age of mark and john and this is the second method. The third method is the most common one is the suppression, another useful technique, the most used in the space of an analyzation. In my opinion, suppression is the process of removing an attribute's value entirely from a data set, while reduction removes part of the attribute value from a data set. With such techniques you can have multiple issues. For example, the warning number one is if the data is collecting for the purpose of determining at which age individuals are most likely to develop a specific illness condition suppressing the age data would make the data itself useless. The warning number two is that the data type will change from integer to string and this will break the contract for all the data consumer of that kind of data asset. In this slide, as you can see, we have a brief comparison of the methods explained previously. For each strategy we evaluated three main secrecy, privacy and utility. And for each capital strategy factor we assigned a rating ranging from poor to best. As you may see, every method has its weakness, so we do not have an evidence of a superior technique that could address all the items and factors around data. In this case c four c, privacy and utility. What we can say is it depends a lot from the use case, but now let's take a look at the encryption methods in order to understand if they could help in the context of compliance. The first method we are going to talk about is format preserving encryption. Format preserving encryption, or SPE is a symmetric encryption algorithm which preserves the format and of the information while it is being encrypted. FPE is weaker than advanced decryption starter. AE's performance presenting encryption can present the length of data as well as its format. FP is by Nissan standard and there are three different model of operation, ff one, ff two and ff three. FPE works very well with existing applications as well as new applications. If an application needs data of a certain language format, then FBE is the way to grow. In order to operate with this algorithm, you should use a separate key and a tweak. Another implementation is provided by Bouncy Castle and the other one is available on Google Cloud as well as provincial toolkit. Now that we have seen the first encryption maple, let's take a look at another one. Homomorphic encryption omofic encryption provides the ability to compute on data while the data is encrypted. It sounds like magic, don't you? There are three different modes. In this case partially Omar pick encryption that allows a ten mathematical function to be used, for example addition or multiplication. Some automorphic encryption. Some function can be performed only a fixed number of times or up to a certain level of complexity. Or in the end we have also fully mm that allows all the function mathematical function to be performed on unlimited times up to any level of complexity without requiring the decryption of the data. So suppose you want to overwrite some sensitive data in a cloud. In the picture you can see you can have that you have on the left the traditional approach. In the example you can encrypt the files before moving to the cloud. For example, with a standard Andrew algorithm like AE's then if you want to perform some transformation on these files, you have to decrypt, apply the transformation and then encrypt again. This will expose data at risk and also introduce a complex operator on the right side. Instead you are going to use the amomorphic encryption. Once you are on the cloud you can do computation on separate text. Also decrypt you will obtain the same result of applying the function to the plaintext data. Unfortunately, it requires a significant computational operator to perform the intensity calculation, making this kind of strategy very slow and very resource intensive. In addition to passwords concern, implementation of this specific caliber can be very challenging with highly complex techniques. Is it all? No, we have also other strategies and methods to present on the table. One of these is tokenization. Tokenization involves substituting sensitive data like credit card number with non sensitive token which are stored securely in a separate database called Totembo. Synthetic data is artificial data generated with the purpose of preserving privacy testing system or creating training data for machine learning algorithms. Synthetic data generation is a critical and very complicated for two main reasons, quality and secrecy. Synthetic data that can be reverse engineered to identify real data would not be useful in privacy context. Faker is a Python package that generates fake data for you. There is also mockru that is another representative into the ecosystem of mock data that allows you to quickly and easily the low large amounts of randomly generated test data based on the specification that you define. This is all in this case. Now let's take a look and summarize what we have learned in the previous slide. The first thing that you could do in order to have to share data for machine learning purpose and analysis in minor environment is to use sample data coming from probably from the production environment. The first thing as a pro ultra realist sampled production data provide a realistic representation of actual data, helping developers and testers to identify issues that may not be evident with synthetic or mock data. You have an improved testing, so using real data allow for more comprehensive and accurate testing of functionality that integrity, performance and scalability. Then you have stakeholder confidence. Using real data increase stakeholder confidence in the testing process and the reliability of the development lifecycle. Within the cons you have that you have a lot of issues with privacy and compliance. Even sample data can contain sense of information, raising privacy concern and potential non compliance with data protection regulations such as GDPR and other ones like Hapa. Security risk using production data in minor environment like development or QA increase the risk of data breach and unauthorized access. Data freshness. This is another issue. Sample data might become outdated quickly, leading to scenarios where dead environments are not completely aligned with current production environment. Let's take a look at standard synthetic data privacy and securities for sure are pros. So synthetic data can be generated without any real world personal data, significantly reducing privacy concern and the risk of data leakage availability. Synth data can be created on demand efficiency. Generating syntactic data can be more cost effective than collecting and labeling a large volume of real world data. On the contrary, what we have lack of real is synthetic data may not capture all the complexities and the nuance of the real world data. We can have problem with overfitting. Also, there will be validation challenges. Validating the accuracy and the reliability of synthetic diagonal could be very challenging as it requires ensuring that the synthetic data closely mimics real warm data distribution. There is also concern about the complexity. It will require isotope complexity creating high quality synthetic data the wheel requires sophisticated techniques and domain knowledge, making the initial setup complex and resource intensive. Encrypted data source let's talk about it, focusing our attention on standard algorithm privacy protection. Anonymized data with encryption reduce the risk of exposing personal information. We are compliant in regulatory sets. Using anonymized data helps organization to comply with GDPR, CPI and so on. We can enable easily a data sharing mechanism and we can share this data with a little bit of freedom within department, organization or also with external partners. We also have a risk mitigation because we are going to reduce the potential for data breach and as the data no longer contains personal identifiable information. But on the contrary, we have some complexity. Working with encrypted data, especially in the case of AE's algorithm, can complicate development phase and also testing activities because data will lose any kind of meaning, will only keep this distribution. Key management challenges effective key management is crucial and can be complex, especially in non production environment where multiple teams and individuals may need access to encryption. Limited testing the curvature testing with encrypted data may not reflect through application behavior. If decryption process introduces delays or errors that wouldn't be or wouldn't occur in production. Anonymized data as we have seen before, the complex it would be complex. The anonymous process probably anonymizing data can be very challenging, requiring some sophisticated techniques and ongoing management to ensure data remains anonymous. There is also the problem of radio identification on the data subject. There is a risk that anonymized data can be reverted, especially if combined with other data sets, for example, knit attack and so forth. In some case we lose the utility, as we have seen for suppression, but now that we have summarized all the possible methods and techniques, at least the most important within the compliance context, let's take a look at the next slide. We are going to present a possible strategy for sharing data in a quite secure way in a minor environment. The practice I'm going to show you will combine some of the methods that we have seen before in the context of a data lake. So before moving forward, let's have a little bit of context. We are in the cloud and we have a data layer in the specified case. In the specific case, we leverage the medallion architecture for our storage layer. The most of you already know what it is a medallion architecture. It is also known as a multi hole architecture. Data at each stage get richer by increasing the intrinsic value. At each stage, PIi can be present and usually machine learning engineer and other scientists operate at silver gold layer. In this slide, let's see what we can do to shift data and minor environments and enable a safe data consumption. This is a receipt for a cloud based scenario, for example AWS, but can be easily replicated in other cloud vendors. So you have a production account on the top and for simplification purpose, an on production account in the bottom. In each layer we have the usual medallion architecture that we have seen before. The step one requires that data teams are in charge to prioritize their job and anonymize data. The encryption process becomes a mandatory step in the data lifecycle made of data ingestion, data normalization and delivery. In step two, we will open a read only cross policy account from production to run product. The minor environment never writes to prod. It is all enabled for reading operations. The encryption key is never shared with manner environments. By this way, user Personas that walk in the lower end barnet are enabled in doing their job. Data in video can prototype new silver dataset reading from the bronze encrypted layer. Analysts can model new schema and generate new anonymized report data. Scientists can propagate their model on quite realistic data set since only sensitive column will be encrypted. Let's take a look at which are the benefits of this kind of practice and strategies. Analyst says format preserving encryption guarantees reference integrity, no schema change across difference data sets and allows to reuse business logic. Derek Jones still works. After the encryption data, engineers are allowed to read only the encrypted layers liberated on a doc IAM policy. This is going to simplify the data movement and the orchestration process between environments. Machine learning engineers can prototype and train their job models on acquired real data on a safe layer. DevOps practice is still in place since deployments of new artifacts and models can follow the standard CICB flow. Secops says minimization principle is respected on minor end since most of the time you have encrypted information. Now let's take a look at and a don't feature and let's talk about all the right to be forgotten below the ecosystem of GDPR. In this slide we are going to talk and present the crypto shredding technique crypto shredding is the practice of deleting data by deleting or overdriving bankruptcry. This is going to require that the data have been encrypted from deleting. The key will automatically logically delete the record and all the existing copies since all the encrypted info are not reversible anymore. This approach is very useful when you have multiple copy of data, for example the card or multiple layers of data like in the Medellin architecture. If you are in the early stage of creating your data layer and building the foundation, you can combine crypto shredding and format preserving encryption in order to enable a very interesting scenario that will catch the sacred data sharing practice that Yves explained before and the deletion problem of multiple layers on environments. It is worthless to say that all these techniques and strategies will function with a strong than a governance practice place, knowing in other bounds that where Pii are stored and their lineage is fundamental. But this is another story. Thank you everybody and let's get in touch from any question and answer.
...

Antonio Murgia

Data Architect @ Agile Lab

Antonio Murgia's LinkedIn account Antonio Murgia's twitter account

Francesco Valentini

Business Unit Leader, Data Architect @ Agile Lab

Francesco Valentini's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways