PreMLS: The undersampling technique based on ClusterCentroids to predict multiple lysine sites
Yun Zuo,
Xingze Fang,
Jiayong Wan,
Wenying He,
Xiangrong Liu,
Xiangxiang Zeng and
Zhaohong Deng
PLOS Computational Biology, 2024, vol. 20, issue 10, 1-21
Abstract:
The translated protein undergoes a specific modification process, which involves the formation of covalent bonds on lysine residues and the attachment of small chemical moieties. The protein’s fundamental physicochemical properties undergo a significant alteration. The change significantly alters the proteins’ 3D structure and activity, enabling them to modulate key physiological processes. The modulation encompasses inhibiting cancer cell growth, delaying ovarian aging, regulating metabolic diseases, and ameliorating depression. Consequently, the identification and comprehension of post-translational lysine modifications hold substantial value in the realms of biological research and drug development. Post-translational modifications (PTMs) at lysine (K) sites are among the most common protein modifications. However, research on K-PTMs has been largely centered on identifying individual modification types, with a relative scarcity of balanced data analysis techniques. In this study, a classification system is developed for the prediction of concurrent multiple modifications at a single lysine residue. Initially, a well-established multi-label position-specific triad amino acid propensity algorithm is utilized for feature encoding. Subsequently, PreMLS: a novel ClusterCentroids undersampling algorithm based on MiniBatchKmeans was introduced to eliminate redundant or similar major class samples, thereby mitigating the issue of class imbalance. A convolutional neural network architecture was specifically constructed for the analysis of biological sequences to predict multiple lysine modification sites. The model, evaluated through five-fold cross-validation and independent testing, was found to significantly outperform existing models such as iMul-kSite and predML-Site. The results presented here aid in prioritizing potential lysine modification sites, facilitating subsequent biological assays and advancing pharmaceutical research. To enhance accessibility, an open-access predictive script has been crafted for the multi-label predictive model developed in this study.Author summary: Proteins undergo a variety of post-translational modifications (PTMs) after synthesis, such as lysine modifications, which significantly influence their structure and function. These modifications of lysine are known to regulate physiological processes, including the inhibition of cancer cell growth, the delay of aging, the regulation of metabolic diseases, and the improvement of depressive disorders. Abnormal modifications are closely associated with the occurrence and progression of a multitude of diseases. Therefore, the identification and comprehension of these modifications are of paramount importance for biological research and drug development. A multitude of studies have focused on a single type of lysine modification, with prediction methods for multiple lysine modification sites being relatively scarce. In this research, a novel multi-label prediction model named PreMLS has been developed for the simultaneous identification of four lysine modifications: methylation, acetylation, crotonylation, and succinylation. The imbalance issue in the dataset was addressed utilizing the ClusterCentroids undersampling algorithm, following which a predictive model, PreMLS, was constructed using a CNN to forecast multiple lysine modification sites. Compared to existing models, this new approach has significantly enhanced the accuracy and reliability of the predictions.
Date: 2024
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012544 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 12544&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1012544
DOI: 10.1371/journal.pcbi.1012544
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().