EconPapers    
Economics at your fingertips  
 

Exploiting redundancy in large materials datasets for efficient machine learning with less data

Kangming Li, Daniel Persaud, Kamal Choudhary, Brian DeCost, Michael Greenwood and Jason Hattrick-Simpers ()
Additional contact information
Kangming Li: University of Toronto
Daniel Persaud: University of Toronto
Kamal Choudhary: National Institute of Standards and Technology
Brian DeCost: National Institute of Standards and Technology
Michael Greenwood: Natural Resources Canada
Jason Hattrick-Simpers: University of Toronto

Nature Communications, 2023, vol. 14, issue 1, 1-10

Abstract: Abstract Extensive efforts to gather materials data have largely overlooked potential data redundancy. In this study, we present evidence of a significant degree of redundancy across multiple large datasets for various material properties, by revealing that up to 95% of data can be safely removed from machine learning training with little impact on in-distribution prediction performance. The redundant data is related to over-represented material types and does not mitigate the severe performance degradation on out-of-distribution samples. In addition, we show that uncertainty-based active learning algorithms can construct much smaller but equally informative datasets. We discuss the effectiveness of informative data in improving prediction performance and robustness and provide insights into efficient data acquisition and machine learning training. This work challenges the “bigger is better” mentality and calls for attention to the information richness of materials data rather than a narrow emphasis on data volume.

Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.nature.com/articles/s41467-023-42992-y Abstract (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:14:y:2023:i:1:d:10.1038_s41467-023-42992-y

Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/

DOI: 10.1038/s41467-023-42992-y

Access Statistics for this article

Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie

More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-19
Handle: RePEc:nat:natcom:v:14:y:2023:i:1:d:10.1038_s41467-023-42992-y