Structured information extraction from scientific text with large language models
John Dagdelen,
Alexander Dunn,
Sanghoon Lee,
Nicholas Walker,
Andrew S. Rosen,
Gerbrand Ceder,
Kristin A. Persson and
Anubhav Jain ()
Additional contact information
John Dagdelen: Lawrence Berkeley National Laboratory
Alexander Dunn: Lawrence Berkeley National Laboratory
Sanghoon Lee: Lawrence Berkeley National Laboratory
Nicholas Walker: Lawrence Berkeley National Laboratory
Andrew S. Rosen: Lawrence Berkeley National Laboratory
Gerbrand Ceder: Lawrence Berkeley National Laboratory
Kristin A. Persson: Lawrence Berkeley National Laboratory
Anubhav Jain: Lawrence Berkeley National Laboratory
Nature Communications, 2024, vol. 15, issue 1, 1-14
Abstract:
Abstract Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)
Downloads: (external link)
https://www.nature.com/articles/s41467-024-45563-x Abstract (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-45563-x
Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/
DOI: 10.1038/s41467-024-45563-x
Access Statistics for this article
Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie
More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().