Coherent cross-modal generation of synthetic biomedical data to advance multimodal precision medicine
Raffaele Marchesi,
Nicolò Lazzaro,
Walter Endrizzi,
Gianluca Leonardi,
Matteo Pozzi,
Flavio Ragni,
Stefano Bovo,
Monica Moroni,
Venet Osmani and
Giuseppe Jurman
PLOS Computational Biology, 2026, vol. 22, issue 4, 1-23
Abstract:
Integration of multimodal, multi-omics data is critical for advancing precision medicine, yet its application is frequently limited by incomplete datasets where one or more modalities are missing. To address this challenge, we developed a generative framework capable of synthesizing any missing modality from an arbitrary subset of available modalities. We introduce Coherent Denoising, a novel ensemble-based generative diffusion method that aggregates predictions from multiple specialized, single-condition models and enforces consensus during the sampling process. We compare this approach against a multi-condition, generative model that uses a flexible masking strategy to handle arbitrary subsets of inputs. The results show that our architectures successfully generate high-fidelity data that preserve the complex biological signals required for downstream tasks. We demonstrate that the generated synthetic data can be used to maintain the performance of predictive models on incomplete patient profiles and can leverage counterfactual analysis to guide the prioritization of diagnostic tests. We validated the framework’s efficacy on a large-scale multimodal, multi-omics cohort from The Cancer Genome Atlas (TCGA) of over 10,000 samples spanning across 20 tumor types, using data modalities such as copy-number alterations (CNA), transcriptomics (RNA-Seq), proteomics (RPPA), and histopathology (WSI). This work establishes a robust and flexible generative framework to address sparsity in multimodal datasets, providing a key step toward improving precision oncology.Author summary: To make precision medicine a reality, doctors need to understand a patient’s status from many angles, using different data types like genetic information (omics) and tissue slide images (histopathology). The problem is that most patient records are incomplete, with one or more of these data types missing, which can limit the effectiveness of powerful predictive tools. We have built a generative AI system designed to learn the complex biological patterns that connect all these different data types. By looking at the patient data that is available, our system can then generate a realistic, synthetic version of any missing piece. We developed a novel method called Coherent Denoising to do this, which is flexible and helps protect patient privacy. We validated this approach on a large dataset of over 10,000 cancer patient profiles. We show that our AI-generated data is high-fidelity and can successfully complete these sparse patient profiles, allowing AI models for crucial tasks like cancer staging and survival prediction to work at their best even with incomplete patient data. We also demonstrate how this tool can be used to evaluate the potential impact of new tests, helping to prioritize which expensive diagnostic tests would be most beneficial for a patient.
Date: 2026
References: Add references at CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1013455 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 13455&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1013455
DOI: 10.1371/journal.pcbi.1013455
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().