Synthetic data and data protection

2 November 2023

Synthetic data is artificially generated data, in contrast to real data which is gathered from reality. A no-real data set, to be labelled like synthetic data, should preserves the characteristics and properties of the real data for a specific use case. Synthetic data could be used in the development, test and validation of machine learning services, where real data is not available in the needed amounts, or even such real data doesn’t exist. Synthetic data could be used like a way to allow data sharing from a company in the framework of Data Spaces without leaking trade secrets. It could be too a privacy technique, when it is used to create no-personal data sets with the same utility than the personal ones.

Reto Scheiwiller - Pixabay.

Large amount of data is needed currently to develop, test and validate machine learning and other data science-based developments. In several cases, the needed amount of data to carry out such processes is not available, because such data doesn’t exist in the quantities required, or because such data must depict situations that still haven’t happened in the real world. On other cases, it is needed specifically test data for verification and validation of the systems that depicts anomalous situations, extreme scenarios, low probability or not recorded circumstances, or even to test types of attacks with manipulated data.

Closely related with the previous situation, synthetic data could support the data driven economy by allowing access to information from public and private entities, what is known as Data Spaces. Of course, organizations will be reluctant to disclosure data that could leak trade secrets, weaknesses of the entity, intellectual property, when they don’t get enough guarantees about the purposes (and the real limits) of the processing of such data or the risk of impact over their interests. Synthetic data generation is one of the several techniques that can address with such issues.

Synthetic data has got more focus in the recent years as it helps in developing testing and validation in natural language understanding systems, vision algorithms for self-driving vehicle systems, or to fraud detection models for financial institutions.

Synthetic data is not random data. When it is synthetized a dataset from real data or created from scratch, it should reproduce the characteristics and structure of real data allowing to draw similar conclusions in specific use cases. Then, they are artificially generated data that has utility at least for one specific purpose. The most basic form of synthetic data, in the border of such definition, could be dummy files that just resemble the real data format. If a synthetic data set that doesn’t reach a minimum threshold of utility for a specific purpose couldn’t be considered properly synthetic data in the framework of such specific purpose.

Synthetic data could replace real data in some specific use cases. Every specific use case will have different quality level requirements, and different requirements regarding the nuances of the process and the final purpose. For example, to validate a face recognition system I could need to generate a dataset of synthetic faces to check the limits of such specific system. However, such data set could lack the needed data quality to check other kind of systems, or to develop new face recognition systems.

Synthetic data, as many other techniques, could work too as a Privacy-Enhancing Technology (PET) as it enables the application of a data protection by design approach when dealing with use cases that need processing personal data. In such cases, synthetic data generation enables to minimize or avoid processing personal data while achieving the defined objectives with conclusions as good as the one obtained from the use of the original personal data. In the GDPR framework, synthetic data should not contain identifiable information even when it could be generated from real personal data. Because synthetic data just retains the statistical properties or distribution of the real personal data for a specific purpose, synthetic data can be used for preventing personal data from being processed.

Creating synthetic data involves a generation or modelling process (“synthesis”) that must fulfil the preservation of analytical value in a specific use case and the compliance with the data protection regulations, this last one landed in a set of privacy requirements. The preservation of analytical value refers to the utility of the method or model, how useful the data set is to the purpose or use case for the data.

The creation of synthetic data from real personal data would itself be a processing activity under GDPR. Therefore, it is necessary to consider the regulatory provisions of the GDPR and in particular the principle of accountability, and the assessment of a possible risk of re-identification from the created synthetic data set.

This synthesis can be performed by using different techniques, such as sequential modelling, simulated data, decision trees or deep learning algorithms. The latter usually uses Generative Adversarial Networks (GANs) in which two competing neural networks train each other iteratively: the generator network tries to learn the underlying structure of the original data and generates the synthetic data points with the same statistical distribution, and the discriminator network attempts to identify the received data as original or synthetic.

Depending on the purpose for which synthetic data is to be used, it may be considered to synthesize all the variables of the original data set (fully synthetic data) or only synthesize some of the variables, for example the most sensitive ones (partially synthetic data). In this last case, disclosure risk of personal data is higher as it contains original data along with synthetic data.

Regardless of the technique chosen, an anonymity assessment should be carried out to ensure that the resulting synthetic data set does not contain identified or identifiable personal information. To prevent disclosure of personal information, other privacy preserving techniques, such us differential privacy, can be applied in addition to synthetic data.

In this way, synthetic data represents a powerful tool following the data protection by design approach, since personal data is not exposed, and can be used in multiple applications. For instance, synthetic data can help overcome data scarcity, improve data quality (e.g., mitigating bias in original data), and enhance data diversity. It can be used by Statistics Offices to release useful data to the public without compromising privacy of the respondents, or by the education and healthcare community to develop analytical skills and discover patterns or insights while protecting the identity and privacy of individuals.

Synthetic data is a dual technology that allows to address issues from the data-driven economy and the privacy requirements. However, synthetic data couldn’t be always the right choice, and its usability must be assessed on a case-by-case basis. In some cases, data sets could be too complex to get a correct understanding of their structure (e.g., correlations, weighting tails, etc.) for a specific case, or it can be difficult to mimic real data outliers. The wrong generated synthetic data can also lead to misunderstandings during the development, test and validation phase. Finally, the assessment of risk of re-identification could give a negative result. In such cases, alternative or complementary PETs should be used.

This post is related with other material released by the AEPD’s Innovation and Technology Division, such us:

Synthetic data and data protection

Entradas relacionadas

Evaluating human intervention in automated decisions

AI System: just one algorithm or multiple algorithms?

Artificial Intelligence: accuracy principle in the processing activity