De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier Attacks

Torres, Nicolás; Olivares, Patricio

De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier Attacks

Nicolás Torres () and Patricio Olivares
Additional contact information
Nicolás Torres: Departamento de Electrónica, Universidad Técnica Federico Santa María, Santiago 8940897, Chile
Patricio Olivares: Departamento de Electrónica, Universidad Técnica Federico Santa María, Santiago 8940897, Chile

Data, 2024, vol. 9, issue 6, 1-27

Abstract: The widespread availability of pseudonymized user datasets has enabled personalized recommendation systems. However, recent studies have shown that users can be de-anonymized by exploiting the uniqueness of their data patterns, raising significant privacy concerns. This paper presents a novel approach that tackles the challenging task of linking user identities across multiple rating datasets from diverse domains, such as movies, books, and music, by leveraging the consistency of users’ rating patterns as high-dimensional quasi-identifiers. The proposed method combines probabilistic record linkage techniques with quasi-identifier attacks, employing the Fellegi–Sunter model to compute the likelihood of two records referring to the same user based on the similarity of their rating vectors. Through extensive experiments on three publicly available rating datasets, we demonstrate the effectiveness of the proposed approach in achieving high precision and recall in cross-dataset de-anonymization tasks, outperforming existing techniques, with F1-scores ranging from 0.72 to 0.79 for pairwise de-anonymization tasks. The novelty of this research lies in the unique integration of record linkage techniques with quasi-identifier attacks, enabling the effective exploitation of the uniqueness of rating patterns as high-dimensional quasi-identifiers to link user identities across diverse datasets, addressing a limitation of existing methodologies. We thoroughly investigate the impact of various factors, including similarity metrics, dataset combinations, data sparsity, and user demographics, on the de-anonymization performance. This work highlights the potential privacy risks associated with the release of anonymized user data across diverse contexts and underscores the critical need for stronger anonymization techniques and tailored privacy-preserving mechanisms for rating datasets and recommender systems.

Keywords: de-anonymization; record linkage; quasi-identifiers; user privacy; recommender systems (search for similar items in EconPapers)
JEL-codes: C8 C80 C81 C82 C83 (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2306-5729/9/6/75/pdf (application/pdf)
https://www.mdpi.com/2306-5729/9/6/75/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jdataj:v:9:y:2024:i:6:p:75-:d:1402709

Access Statistics for this article

Data is currently edited by Ms. Becky Zhang

More articles in Data from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().