Synthesizing Tabular Microdata with Gaussian Copulas: The rsdv R Package
Kailas Venkitasubramanian
No t6cne_v1, SocArXiv from Center for Open Science
Abstract:
Administrative records, survey microdata, and clinical research data carry identifying detail that data-governance procedures restrict from open release. Synthetic data — artificial rows that preserve the distributional structure of a real dataset without releasing any single observation — is now widely used as a substitute when the underlying records cannot be shared. The R ecosystem has had good options for parts of this workflow, but no native implementation of the copula-based joint-modelling design that the Python Synthetic Data Vault (SDV) library popularised. The `rsdv` package fills that gap: it fits a Gaussian copula jointly over numerical, categorical, and boolean columns, supports conditional sampling and declarative row-level constraints, and ships three evaluation reports (quality, structural validity, and privacy) modelled on SDMetrics' two-property hierarchy. I describe the package's design, benchmark it against the most widely used R alternatives (`synthpop` and `arf`) on the UCI Adult Income data, report quality scores as a function of training-set size, and quantify attribute-disclosure risk across four threat models. `synthpop` produces the highest marginal fidelity on this dataset; `arf` is fast and competitive; `rsdv` is comparable to `synthpop` on privacy and uniquely brings the SDV-style integrated reporting pipeline to R. The package is on CRAN, has 200+ tests, three vignettes, and a reproducible replication archive accompanies this paper.
Date: 2026-06-09
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://osf.io/download/6a2753badd88c0c834cdd34c/
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:osf:socarx:t6cne_v1
DOI: 10.31219/osf.io/t6cne_v1
Access Statistics for this paper
More papers in SocArXiv from Center for Open Science
Bibliographic data for series maintained by OSF ().