Anonymization III: The risk of re-identification

The anonymization is a processing that requires the application of the proactive responsibility principles. This means that the controller must ensure, with a formal analysis, that the anonymized data set is not re-identifiable. However, it must be assumed that there could be a residual probability of re-identification.  It is therefore necessary to analyze the impact that re-identification could have on individuals, establish whether additional measures should be implemented to reduce the risk to data subjects, to assess the necessity and proportionality of anonymisation processing, and to conclude whether the anonymisation process offers sufficient guarantees to protect fundamental rights.

Photo from Chris Yang  Unsplash.

Anonymization is a processing of personal data that generates, from a set of personal data, a new set of anonymous information.  Like any processing, it must comply with the principles of the RGPD, including accountability. This implies that the controller has to take appropriate measures to carry out the processing of anonymization that has to take into account the necessary guarantees and, in particular, has to consider the risks to the data subjects that the anonymization process may be reversed.  

The anonymization processing is not a trivial process. The person in charge must employ the right professionals, with knowledge of the state of the art in anonymization techniques, and with experience in reidentification attacks. After an accountable and suitable anonymization processing for the data set, it must be assessed through analysis and practical tests that it is not possible to re-identify the data set. For this purpose, worst-case conditions must be considered, such as attempts at re-identification by internal or external  persons, with access to auxiliary data, even those only available by illegal means, by court orders or by information agencies. It should be taking into account that they have adequate resources and the controller should extrapolate the possible evolution of known techniques. If under these conditions the whole, or just a part of the dataset can be re-identified, there would be no question about risk of re-identification, that dataset is simply not anonymous.

However, there is no human activity that reaches perfection and there will always be a residual probability of reidentification that must be assumed by the controller.  This residual probability means accepting that total and absolute infallibility does not exist.  In any case, the controller must do what is mentioned in previous paragraph: application of accountability with appropriate measures to ensure compliance taking into account the nature, context, scope, purposes and risks to rights and freedoms, in addition to its review and updating.

For example, when anonymized data of New York taxi races were published, including data about their origins, destinations, times and payments, no passenger information was included, therefore, it was assumed that it was anonymous data. In addition, the taxis' IDs were masked with a hash. The latter was a mistake, already well known, since retrieving the information masked by the hash for the 173 million journeys required less than an hour.  However, regarding taxi passengers, there was an additional source of information that had not been taken into account: the photos that Google published of celebrities taking a taxi. In the photos you could see the identifier of the taxi and,  once broken the hash, it was possible to determine destinations, and the payments made, of numerous personalities. This is a clear example of re-identification, of the materialization of a residual probability of re-identification and how important is the experience regarding the proper implementation of anonymization processing. Applying a simplified vision of the anonymization processing, with automatisms, without formal analysis, and ignoring a validation phase of the final result means failing to comply with the obligations of accountability.

The life of a personal data is as long as that of the data subject. Then, let us think of the importance of children's health data.  Assuming that in that long period of time there is a residual probability of  re-identification,  it is necessary to assess what would be the impact that, in the case of reidentification, the data subjects could suffer in their rights and freedoms. This analysis must take into account not only whether special categories of data could be disclosed, but also all the consequences for fundamental rights of disclosing certain personal information. For example, for certain survivors of gender-based violence, revealing their home or geolocation patterns can pose a very high risk to their life’s.

If there is a significant impact on the rights of individuals, taking into account that there is a residual probability of re-identification to be assumed, certain measures will have to be taken to reduce the risk to data subjects.

The first class of measures that can be implemented are those that reduce the impact of re-identification itself.  An example of impact reduction is the removal of certain records or more sensitive attributes from the dataset.  Regarding to the example of survivors of gender-based violence described above, if it is known that in the original dataset there are records relating to people whose re-identification had more impact, the possible impact could be reduced by eliminating these records. On the other hand, if attributes are identified that in case of reidentification could have more impact, specific minimization techniques could be used (generalizer/blur/add/reduce the frequency of collection, delete...)  about them.

The second class of measures are those that further reduce that residual probability of re-identification that must be assumed to exist by default. An example is those that include legal safeguards beyond the obligations of the GDPR, such as contractually limiting the scope of dissemination of anonymous data (e.g. only among a group of researchers) or establishing storage requirements and limitations, which are the kind of common guarantees to reduce another class of risks in non-personal data.

Given a risk to rights and freedoms, if it is not possible to be mitigated sufficiently, either  because it is not possible to  reduce the impact, or because additional guarantees to reduce the  residual  probability are not effective, or because the data set thus distributed does not meet the utility or quality requirements, it will be necessary to consider whether anonymization is the right way to proceed.

In short, an anonymization processing must generate a set of data that is evaluated as anonymous, through a process of proven quality in which it achieves reasonable evidence of impossibility of reidentification. The controller has to assess the impact of a re-identification on the fundamental rights of data subjects. In turn, assuming that there is always a residual probability of re-identification, the controller should assess the risk of re-identification faced by data subjects and apply additional measures to reduce that risk if necessary. If an anonymization processing cannot generate a set of data with the necessary  quality requirements, such processing will not comply with the requirement of necessity to which all processing legitimized by Articles  6(1)(b) to 6(1)(f) of the GDPR.  If, on the other hand,  the risk of re-identification does not meet proportionality criteria, then alternatives to anonymization will have to be taken into account.

More material on anonymization is available on the Innovation and Technology microsite of the AEPD website, such as: