Impacts of Data Synthesis: A Metric for Quantifiable Data Standards and Performances
Gunjan Chandra (),
Pekka Siirtola,
Satu Tamminen,
Mikael J. Knip,
Riitta Veijola and
Juha Röning
Additional contact information
Gunjan Chandra: Biomimetics and Intelligent Systems Group, Faculty of Information Technology and Electrical Engineering, University of Oulu, Pentti Kaiteran katu 1, 90570 Oulu, Finland
Pekka Siirtola: Biomimetics and Intelligent Systems Group, Faculty of Information Technology and Electrical Engineering, University of Oulu, Pentti Kaiteran katu 1, 90570 Oulu, Finland
Satu Tamminen: Biomimetics and Intelligent Systems Group, Faculty of Information Technology and Electrical Engineering, University of Oulu, Pentti Kaiteran katu 1, 90570 Oulu, Finland
Mikael J. Knip: Pediatric Research Center, Children’s Hospital, University of Helsinki and Helsinki University Hospital, Yliopistonkatu 4, 00100 Helsinki, Finland
Riitta Veijola: Department of Paediatrics, University of Oulu, Oulu University Hospital, Kajaanintie 50, 90220 Oulu, Finland
Juha Röning: Biomimetics and Intelligent Systems Group, Faculty of Information Technology and Electrical Engineering, University of Oulu, Pentti Kaiteran katu 1, 90570 Oulu, Finland
Data, 2022, vol. 7, issue 12, 1-26
Abstract:
Clinical data analysis could lead to breakthroughs. However, clinical data contain sensitive information about participants that could be utilized for unethical activities, such as blackmailing, identity theft, mass surveillance, or social engineering. Data anonymization is a standard step during data collection, before sharing, to overcome the risk of disclosure. However, conventional data anonymization techniques are not foolproof and also hinder the opportunity for personalized evaluations. Much research has been done for synthetic data generation using generative adversarial networks and many other machine learning methods; however, these methods are either not free to use or are limited in capacity. This study evaluates the performance of an emerging tool named synthpop, an R package producing synthetic data as an alternative approach for data anonymization. This paper establishes data standards derived from the original data set based on the utilities and quality of information and measures variations in the synthetic data set to evaluate the performance of the data synthesis process. The methods to assess the utility of the synthetic data set can be broadly divided into two approaches: general utility and specific utility. General utility assesses whether synthetic data have overall similarities in the statistical properties and multivariate relationships with the original data set. Simultaneously, the specific utility assesses the similarity of a fitted model’s performance on the synthetic data to its performance on the original data. The quality of information is assessed by comparing variations in entropy bits and mutual information to response variables within the original and synthetic data sets. The study reveals that synthetic data succeeded at all utility tests with a statistically non-significant difference and not only preserved the utilities but also preserved the complexity of the original data set according to the data standard established in this study. Therefore, synthpop fulfills all the necessities and unfolds a wide range of opportunities for the research community, including easy data sharing and information protection.
Keywords: synthpop; data sharing; data anonymization; machine learning; mutual information; data quality (search for similar items in EconPapers)
JEL-codes: C8 C80 C81 C82 C83 (search for similar items in EconPapers)
Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2306-5729/7/12/178/pdf (application/pdf)
https://www.mdpi.com/2306-5729/7/12/178/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jdataj:v:7:y:2022:i:12:p:178-:d:1000188
Access Statistics for this article
Data is currently edited by Ms. Cecilia Yang
More articles in Data from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().