From scraping to ethical sharing: Initial considerations for Virtuous Innovative Approaches and Data Use Collaboration in AI Training (VIADUCT)

Constantin, Jean; Dietrich, Yann; Langé, Marie; Monthubert, Bertrand

From scraping to ethical sharing: Initial considerations for Virtuous Innovative Approaches and Data Use Collaboration in AI Training (VIADUCT)

Jean Constantin (), Yann Dietrich (), Marie Langé () and Bertrand Monthubert ()
Additional contact information
Jean Constantin: Inria Siège - Inria - Institut National de Recherche en Informatique et en Automatique
Yann Dietrich: Atos
Bertrand Monthubert: IMT - Institut de Mathématiques de Toulouse UMR5219 - INSA Toulouse - Institut National des Sciences Appliquées - Toulouse - INSA - Institut National des Sciences Appliquées - Comue de Toulouse - Communauté d'universités et établissements de Toulouse - UT2J - Université Toulouse - Jean Jaurès - Comue de Toulouse - Communauté d'universités et établissements de Toulouse - INUC - Institut national universitaire Champollion - Comue de Toulouse - Communauté d'universités et établissements de Toulouse - CNRS - Centre National de la Recherche Scientifique - TSE-R - Toulouse School of Economics - UT Capitole - Université Toulouse Capitole - Comue de Toulouse - Communauté d'universités et établissements de Toulouse - EHESS - École des hautes études en sciences sociales - CNRS - Centre National de la Recherche Scientifique - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement - EPE UT - Université de Toulouse - Comue de Toulouse - Communauté d'universités et établissements de Toulouse, Equipe BIOETHICS (CERPOP) - CERPOP - Centre d'Epidémiologie et de Recherche en santé des POPulations - INSERM - Institut National de la Santé et de la Recherche Médicale - EPE UT - Université de Toulouse - Comue de Toulouse - Communauté d'universités et établissements de Toulouse

Working Papers from HAL

Abstract: The rapid development of artificial intelligence (AI) relies on access to vast volumes of data throughout its lifecycle. The sourcing of this data has relied on legally and ethically contentious practices, particularly the indiscriminate scraping of publicly available and often copyrighted content. Popular datasets like CommonCrawl and LAION 5B contain copyrighted works and personal data used without explicit permission or compensation for data holders. This approach has triggered a global backlash, with over 50 lawsuits filed against AI developers and increasing technical barriers against scraper robots. Leaders in the AI industry now warn of "peak data", as public human-generated content will soon be exhausted. This scarcity conflicts with AI's ever-growing appetite for high quality expert data to support increasingly advanced applications. Data for AI is not uniform but spans multiple domains and governance regimes which can evolve or overlap depending on contexts and jurisdictions 1 . Each of these regimes: copyrighted content, personal data, trade secrets, government data, and open data, is constrained by distinct legal and technical restrictions. Copyrighted materials require permission from holders, yet enforcement of opt-out decisions remains inconsistent. Personal data is protected under GDPR, demanding anonymisation and clear legal grounds for processing, while trade secret datasets are shielded by confidentiality agreements. Government data, though mandated to be open, often remains inaccessible due to sensitivity or infrastructure limitations. Open data, while legally permissive, suffers from fragmentation and underinvestment. These disparities create a fragmented landscape where data sharing is hindered by transaction costs, confidentiality requirements, and misaligned incentives.Efforts to address these challenges have produced partial solutions. Opt-out mechanisms like ai.txt and TDMRep allow data holders to declare preferences but lack standardisation. Privacy preserving techniques enable secure data processing but at high computational cost. Licensing agreements can bring legal clarifications but are hindered by contractual complexity. Data attribution models, designed to compensate data holders, remain impractical at scale. No single solution suffices, highlighting the need for context specific approaches that balance innovation with data holders' interests. Fostering ethical data sharing is not trivial and requires addressing multiple technical, economic and legal obstacles. The VIADUCT initiative proposes an experimental approach, engaging with data holders and AI developers to characterize constraints and explore innovative data sharing approaches.

Keywords: Data; amp; AI; Data governance; Business model; Data sharing (search for similar items in EconPapers)
Date: 2025-12
Note: View the original document on HAL open archive server: https://hal.science/hal-05581044v1
References: Add references at CitEc
Citations:

Published in Global Partnership on Artificial Intelligence. 2025

Downloads: (external link)
https://hal.science/hal-05581044v1/document (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:hal:wpaper:hal-05581044

Access Statistics for this paper

More papers in Working Papers from HAL
Bibliographic data for series maintained by CCSD ().