EconPapers    
Economics at your fingertips  
 

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Umberto Lupo (), Damiano Sgarbossa and Anne-Florence Bitbol ()
Additional contact information
Umberto Lupo: Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)
Damiano Sgarbossa: Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)
Anne-Florence Bitbol: Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)

Nature Communications, 2022, vol. 13, issue 1, 1-11

Abstract: Abstract Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold’s EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer’s row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer’s column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.

Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://www.nature.com/articles/s41467-022-34032-y Abstract (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-34032-y

Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/

DOI: 10.1038/s41467-022-34032-y

Access Statistics for this article

Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie

More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-19
Handle: RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-34032-y