Automatic Detection of Sensitive Data Using Transformer- Based Classifiers
Michael Petrolini,
Stefano Cagnoni and
Monica Mordonini
Additional contact information
Michael Petrolini: Department of Engineering and Architecture, University of Parma, Parco Area delle Scienze 181a, 43124 Parma, Italy
Stefano Cagnoni: Department of Engineering and Architecture, University of Parma, Parco Area delle Scienze 181a, 43124 Parma, Italy
Monica Mordonini: Department of Engineering and Architecture, University of Parma, Parco Area delle Scienze 181a, 43124 Parma, Italy
Future Internet, 2022, vol. 14, issue 8, 1-15
Abstract:
The General Data Protection Regulation (GDPR) has allowed EU citizens and residents to have more control over their personal data, simplifying the regulatory environment affecting international business and unifying and homogenising privacy legislation within the EU. This regulation affects all companies that process data of European residents regardless of the place in which they are processed and their registered office, providing for a strict discipline of data protection. These companies must comply with the GDPR and be aware of the content of the data they manage; this is especially important if they are holding sensitive data, that is, any information regarding racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, data relating to the sexual life or sexual orientation of the person, as well as data on physical and mental health. These classes of data are hardly structured, and most frequently they appear within a document such as an email message, a review or a post. It is extremely difficult to know if a company is in possession of sensitive data at the risk of not protecting them properly. The goal of the study described in this paper is to use Machine Learning, in particular the Transformer deep-learning model, to develop classifiers capable of detecting documents that are likely to include sensitive data. Additionally, we want the classifiers to recognize the particular type of sensitive topic with which they deal, in order for a company to have a better knowledge of the data they own. We expect to make the model described in this paper available as a web service, customized to private data of possible customers, or even in a free-to-use version based on the freely available data set we have built to train the classifiers.
Keywords: GDPR; sensitive data; personal data; natural language processing; BERT; transformers (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2022
References: View complete reference list from CitEc
Citations: View citations in EconPapers (1)
Downloads: (external link)
https://www.mdpi.com/1999-5903/14/8/228/pdf (application/pdf)
https://www.mdpi.com/1999-5903/14/8/228/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:14:y:2022:i:8:p:228-:d:872831
Access Statistics for this article
Future Internet is currently edited by Ms. Grace You
More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().