Ten quick tips for protecting health data using de-identification and perturbation of structured datasets
Tshikala Eddie Lulamba,
Themba Mutemaringa and
Nicki Tiffin
PLOS Computational Biology, 2025, vol. 21, issue 9, 1-16
Abstract:
Structured patient data generated within the health data ecosystem are shared both internally for operational use and also externally for research and public health benefit. Protecting individual privacy and health data confidentiality in these contexts relies on data de-identification and anonymisation, although there are no universally accepted standards for these processes and the techniques involved can be technically complex. We present practical recommendations grounded in the principle of data minimisation—avoiding unnecessary granularity and identifying variables that could lead to re-identification when combined with other datasets. We provide practical guidance for anonymising and perturbing structured health data in ways that support compliance with data protection laws, describing technical and operational methods for reducing re-identification risk that include rounding numerical values, replacing precise values with ranges, adding jitter to numeric fields, aggregating data, management of date values and separating sensitive fields from identifying data to prevent linkage leading to re-identification. While some methods require advanced technical knowledge, we focus here on accessible strategies that can be implemented without specialist expertise, recognising the importance of the legal and governance frameworks in which anonymisation occurs. These guidelines support researchers, data managers and institutions in sharing health data responsibly, maintaining data utility while upholding privacy and promoting ethical and legal data stewardship for data-driven health research.Author summary: Healthcare systems and health research programmes collect large amounts of patient data that are often shared both within organisations and across institutional boundaries. Health data are highly sensitive, and it is essential to ensure that individuals cannot be identified or recognised through the use of their health information. Data de-identification and anonymisation are the most common approaches for protecting individuals’ privacy and confidentiality in these settings, but there are no universal standards for these processes and they can be technically complex to apply. Here we describe practical, accessible technical and operational security measures that can be used to de-identify and anonymise structured health data in ways that comply with data protection laws. These practical guidelines can support data analysts and researchers working with sensitive health data, including those without prior experience in data anonymisation, to implement effective privacy-preserving techniques, including perturbation, for large, structured health-related datasets.
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1013507 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 13507&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1013507
DOI: 10.1371/journal.pcbi.1013507
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().