A Systematic Comparison Between Open- and Closed-Source Large Language Models in the Context of Generating GDPR-Compliant Data Categories for Processing Activity Records

von Schwerin, Magdalena; Reichert, Manfred

A Systematic Comparison Between Open- and Closed-Source Large Language Models in the Context of Generating GDPR-Compliant Data Categories for Processing Activity Records

Magdalena von Schwerin () and Manfred Reichert
Additional contact information
Magdalena von Schwerin: Institute of Databases and Information Systems, Ulm University, 89081 Ulm, Germany
Manfred Reichert: Institute of Databases and Information Systems, Ulm University, 89081 Ulm, Germany

Future Internet, 2024, vol. 16, issue 12, 1-24

Abstract: This study investigates the capabilities of open-source Large Language Models (LLMs) in automating GDPR compliance documentation, specifically in generating data categories—types of personal data (e.g., names, email addresses)—for processing activity records, a document required by the General Data Protection Regulation (GDPR). By comparing four state-of-the-art open-source models with the closed-source GPT-4, we evaluate their performance using benchmarks tailored to GDPR tasks: a multiple-choice benchmark testing contextual knowledge (evaluated by accuracy and F1 score) and a generation benchmark evaluating structured data generation. In addition, we conduct four experiments using context-augmenting techniques such as few-shot prompting and Retrieval-Augmented Generation (RAG). We evaluate these on performance metrics such as latency, structure, grammar, validity, and contextual understanding. Our results show that open-source models, particularly Qwen2-7B, achieve performance comparable to GPT-4, demonstrating their potential as cost-effective and privacy-preserving alternatives. Context-augmenting techniques show mixed results, with RAG improving performance for known categories but struggling with categories not contained in the knowledge base. Open-source models excel at structured legal tasks, although challenges remain in handling ambiguous legal language and unstructured scenarios. These findings underscore the viability of open-source models for GDPR compliance, while highlighting the need for fine-tuning and improved context augmentation to address complex use cases.

Keywords: large language model; GDPR documentation; natural language processing (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/16/12/459/pdf (application/pdf)
https://www.mdpi.com/1999-5903/16/12/459/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:16:y:2024:i:12:p:459-:d:1537503

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().