Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering

O’Shaughnessy, Pauline; Lin, Yan-Xia

Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering

Pauline O’Shaughnessy () and Yan-Xia Lin
Additional contact information
Pauline O’Shaughnessy: School of Mathematics and Applied Statistics, University of Wollongong, Wollongong, NSW 2522, Australia
Yan-Xia Lin: School of Mathematics and Applied Statistics, University of Wollongong, Wollongong, NSW 2522, Australia

Mathematics, 2022, vol. 10, issue 24, 1-13

Abstract: In the age of data, data mining provides feasible tools with which to handle large datasets consisting of data from multiple sources. However, there is limited research on retrieving statistical information from data when data are confidential and cannot be shared directly. In this paper, we address this problem and propose a framework for performing data analysis using data from multiple sources without revealing true values for privacy purposes. The proposed framework includes three steps. First, data custodians individually mask data before publishing; then, the masked data collection is used to reconstruct the density function of the original dataset, from which resampled values are generated; last, existing data mining techniques are applied directly to the resampled data. This framework utilises the technique of reconstructing an original density function from noise-masked data using the moment-based density estimation method, which plays an essential role. Simulation studies show that the proposed framework performs well; analysis results from the resampled data are comparable to those of the original data when the density of the original data is estimated well. The proposed framework is demonstrated in data clustering analysis using the example of a real-life Australian soybean dataset. Results from the k-means algorithms with two and three fitted clusters are presented to show that cluster analysis using resampled data can well replicate that of the original data.

Keywords: data masking; multiplicative noise; data mining; sample size calculation (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2022
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/10/24/4744/pdf (application/pdf)
https://www.mdpi.com/2227-7390/10/24/4744/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:10:y:2022:i:24:p:4744-:d:1002989

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().