EconPapers    
Economics at your fingertips  
 

Semi-supervised Bayesian integration of multiple spatial proteomics datasets

Stephen Coleman, Lisa Breckels, Ross F Waller, Kathryn S Lilley, Chris Wallace, Oliver M Crook and Paul D W Kirk

PLOS Computational Biology, 2025, vol. 21, issue 12, 1-26

Abstract: The subcellular localisation of proteins is a key determinant of their function. High-throughput analyses of these localisations can be performed using mass spectrometry-based spatial proteomics, which enables us to examine the localisation and relocalisation of proteins. Furthermore, complementary data sources can provide additional sources of functional or localisation information. Examples include protein annotations and other high-throughput ‘omic assays. Integrating these modalities can provide new insights as well as additional confidence in results, but existing approaches for integrative analyses of spatial proteomics datasets, such as concatenation-based methods and transfer learning approaches like KNN-TL, are limited in the types of data they can integrate and do not quantify uncertainty in their predictions. Here we propose a semi-supervised Bayesian approach (wherein model parameters are inferred from both labeled marker proteins and unlabeled data while quantifying prediction uncertainty) to integrate spatial proteomics datasets with other data sources, to improve the inference of protein sub-cellular localisation. We demonstrate our approach outperforms other transfer-learning methods and has greater flexibility in the data it can model - including categorical annotations (e.g., Gene Ontology terms), continuous measurements (e.g., protein abundance), and temporal profiles (e.g., time-series expression data). To demonstrate the flexibility of our approach, we apply our method to integrate spatial proteomics data generated for the parasite Toxoplasma gondii with time-series gene expression data generated over its cell cycle. Our findings suggest that proteins linked to invasion organelles are associated with expression programs that peak at the end of the first cell-cycle. Furthermore, this integrative analysis divides the dense granule proteins into heterogeneous populations suggestive of potentially different functions. Our method is disseminated via the mdir R package available on the lead author’s Github.Author summary: Proteins are located in subcellular environments to ensure that they are near their interaction partners and occur in the correct biochemical environment to function. Where a protein is located can be determined from a number of data sources. To integrate diverse datasets together we develop an integrative Bayesian model to combine the information from several datasets in a principled manner. We learn how similar the dataset are as part of the modelling process and demonstrate the benefits of integrating mass-spectrometry based spatial proteomics data with timecourse gene-expression datasets.

Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1013799 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 13799&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1013799

DOI: 10.1371/journal.pcbi.1013799

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().

 
Page updated 2025-12-21
Handle: RePEc:plo:pcbi00:1013799