EconPapers    
Economics at your fingertips  
 

Maximizing the reusability of gene expression data by predicting missing metadata

Pei-Yau Lung, Dongrui Zhong, Xiaodong Pang, Yan Li and Jinfeng Zhang

PLOS Computational Biology, 2020, vol. 16, issue 11, 1-18

Abstract: Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.Author summary: Large volumes of gene expression data are available at public databases such as Gene Expression Omnibus (GEO) and sequence read archive (SRA). They can be reanalyzed to solve previously infeasible biological problems. However, reanalysis studies using public genomics data have been hindered by the lack of necessary metadata for the analyses. This can be addressed by predicting the metadata using the gene expression data, which can then be used in the desired reanalysis with predicted metadata. This represents a new approach to increase the reusability of public gene expression data. Our study attempts to systematically investigate how this approach should be carried out. We found that one should not use all the gene expression data with metadata predicted for downstream analyses. While using all the gene expression data maximizes the sample size, the poorly predicted expression profiles may affect the quality of the downstream analysis. One needs to strike a balance between the amount of data included in the downstream analysis and the accuracy of predicted metadata. To address this problem, we designed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.

Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007450 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 07450&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1007450

DOI: 10.1371/journal.pcbi.1007450

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().

 
Page updated 2025-03-19
Handle: RePEc:plo:pcbi00:1007450