EconPapers    
Economics at your fingertips  
 

Unsupervised word embeddings capture latent knowledge from materials science literature

Vahe Tshitoyan (), John Dagdelen, Leigh Weston, Alexander Dunn, Ziqin Rong, Olga Kononova, Kristin A. Persson, Gerbrand Ceder () and Anubhav Jain ()
Additional contact information
Vahe Tshitoyan: Lawrence Berkeley National Laboratory
John Dagdelen: Lawrence Berkeley National Laboratory
Leigh Weston: Lawrence Berkeley National Laboratory
Alexander Dunn: Lawrence Berkeley National Laboratory
Ziqin Rong: Lawrence Berkeley National Laboratory
Olga Kononova: University of California
Kristin A. Persson: Lawrence Berkeley National Laboratory
Gerbrand Ceder: Lawrence Berkeley National Laboratory
Anubhav Jain: Lawrence Berkeley National Laboratory

Nature, 2019, vol. 571, issue 7763, 95-98

Abstract: Abstract The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3–10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11–13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.

Date: 2019
References: Add references at CitEc
Citations: View citations in EconPapers (29)

Downloads: (external link)
https://www.nature.com/articles/s41586-019-1335-8 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:nat:nature:v:571:y:2019:i:7763:d:10.1038_s41586-019-1335-8

Ordering information: This journal article can be ordered from
https://www.nature.com/

DOI: 10.1038/s41586-019-1335-8

Access Statistics for this article

Nature is currently edited by Magdalena Skipper

More articles in Nature from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-19
Handle: RePEc:nat:nature:v:571:y:2019:i:7763:d:10.1038_s41586-019-1335-8