AI models collapse when trained on recursively generated data
Ilia Shumailov (),
Zakhar Shumaylov (),
Yiren Zhao,
Nicolas Papernot,
Ross Anderson and
Yarin Gal ()
Additional contact information
Ilia Shumailov: University of Oxford
Zakhar Shumaylov: University of Cambridge
Yiren Zhao: Imperial College London
Nicolas Papernot: University of Toronto
Ross Anderson: University of Cambridge
Yarin Gal: University of Oxford
Nature, 2024, vol. 631, issue 8022, 755-759
Abstract:
Abstract Stable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high performance across a variety of language tasks. ChatGPT introduced such language models to the public. It is now clear that generative artificial intelligence (AI) such as large language models (LLMs) is here to stay and will substantially change the ecosystem of online text and images. Here we consider what may happen to GPT-{n} once LLMs contribute much of the text found online. We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. We refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs). We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models. We demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet.
Date: 2024
References: Add references at CitEc
Citations: View citations in EconPapers (3)
Downloads: (external link)
https://www.nature.com/articles/s41586-024-07566-y Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:nat:nature:v:631:y:2024:i:8022:d:10.1038_s41586-024-07566-y
Ordering information: This journal article can be ordered from
https://www.nature.com/
DOI: 10.1038/s41586-024-07566-y
Access Statistics for this article
Nature is currently edited by Magdalena Skipper
More articles in Nature from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().