An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets
Joerg Drechsler () and
Jerome P. Reiter
Computational Statistics & Data Analysis, 2011, vol. 55, issue 12, 3232-3243
When intense redaction is needed to protect the confidentiality of data subjects' identities and sensitive attributes, statistical agencies can use synthetic data approaches. To create synthetic data, the agency replaces identifying or sensitive values with draws from statistical models estimated from the confidential data. Many agencies are reluctant to implement this idea because (i) the quality of the generated data depends strongly on the quality of the underlying models, and (ii) developing effective synthesis models can be a labor-intensive and difficult task. Recently, there have been suggestions that agencies use nonparametric methods from the machine learning literature to generate synthetic data. These methods can estimate non-linear relationships that might otherwise be missed and can be run with minimal tuning, thus considerably reducing burdens on the agency. Four synthesizers based on machine learning algorithms-classification and regression trees, bagging, random forests, and support vector machines-are evaluated in terms of their potential to preserve analytical validity while reducing disclosure risks. The evaluation is based on a repeated sampling simulation with a subset of the 2002 Uganda census public use sample data. The simulation suggests that synthesizers based on regression trees can result in synthetic datasets that provide reliable estimates and low disclosure risks, and that these synthesizers can be implemented easily by statistical agencies.
Keywords: Census; Confidentiality; Disclosure; Imputation; Microdata; Synthetic (search for similar items in EconPapers)
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (4) Track citations by RSS feed
Downloads: (external link)
Full text for ScienceDirect subscribers only.
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
Persistent link: https://EconPapers.repec.org/RePEc:eee:csdana:v:55:y:2011:i:12:p:3232-3243
Access Statistics for this article
Computational Statistics & Data Analysis is currently edited by S.P. Azen
More articles in Computational Statistics & Data Analysis from Elsevier
Bibliographic data for series maintained by Haili He ().