From Principal Component to Direct Coupling Analysis of Coevolution in Proteins: Low-Eigenvalue Modes are Needed for Structure Prediction

Cocco, Simona; Monasson, Remi; Weigt, Martin

From Principal Component to Direct Coupling Analysis of Coevolution in Proteins: Low-Eigenvalue Modes are Needed for Structure Prediction

Simona Cocco, Remi Monasson and Martin Weigt

PLOS Computational Biology, 2013, vol. 9, issue 8, 1-17

Abstract: Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant ‘patterns’ of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold.Author Summary: Extracting functional and structural information about protein families from the covariation of residues in multiple sequence alignments is an important challenge in computational biology. Here we propose a statistical-physics inspired framework to analyze those covariations, which naturally unifies existing methods in the literature. Our approach allows us to identify statistically relevant ‘patterns’ of residues, specific to a protein family. We show that many patterns correspond to a small number of sites on the protein sequence, in close contact on the 3D fold. Hence, those patterns allow us to make accurate predictions about the contact map from sequence data only. Further more, we show that the dimensional reduction, which is achieved by considering only the statistically most significant patterns, avoids overfitting in small sequence alignments, and improves our capacity of extracting residue contacts in this case.

Date: 2013
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (5)

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003176 (text/html)
https://journals.plos.org/ploscompbiol/article?id= ... 03176&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1003176

DOI: 10.1371/journal.pcbi.1003176

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().