Poly(A)-DG: A deep-learning-based domain generalization method to identify cross-species Poly(A) signal without prior knowledge from target species
Yumin Zheng,
Haohan Wang,
Yang Zhang,
Xin Gao,
Eric P Xing and
Min Xu
PLOS Computational Biology, 2020, vol. 16, issue 11, 1-21
Abstract:
In eukaryotes, polyadenylation (poly(A)) is an essential process during mRNA maturation. Identifying the cis-determinants of poly(A) signal (PAS) on the DNA sequence is the key to understand the mechanism of translation regulation and mRNA metabolism. Although machine learning methods were widely used in computationally identifying PAS, the need for tremendous amounts of annotation data hinder applications of existing methods in species without experimental data on PAS. Therefore, cross-species PAS identification, which enables the possibility to predict PAS from untrained species, naturally becomes a promising direction. In our works, we propose a novel deep learning method named Poly(A)-DG for cross-species PAS identification. Poly(A)-DG consists of a Convolution Neural Network-Multilayer Perceptron (CNN-MLP) network and a domain generalization technique. It learns PAS patterns from the training species and identifies PAS in target species without re-training. To test our method, we use four species and build cross-species training sets with two of them and evaluate the performance of the remaining ones. Moreover, we test our method against insufficient data and imbalanced data issues and demonstrate that Poly(A)-DG not only outperforms state-of-the-art methods but also maintains relatively high accuracy when it comes to a smaller or imbalanced training set.Author summary: The key to understanding the mechanism of translation regulation and mRNA metabolism is to identify the cis-determinants of PAS on the DNA sequence. PAS leads to correct identification of Poly(A) sites which play an essential role in understanding human diseases. While many researchers have employed deep learning methods to improve the performance of PAS identification, an underlying problem is the expensive and time-consuming nature of PAS data collection, which makes the application of deep learning models for identifying PAS from a broad range of species a tough task. We attempt to use domain generalization methods, inspired by its thrive in the field of computer vision, to overcome the insufficient annotation data challenge in PAS data. Here, empirical results suggest that our proposed model Poly(A)-DG can extract species-invariant features from multiple training species and be directly applied to the target species without fine-tuning. Furthermore, Poly(A)-DG is a promising practical tool for PAS identification with its stable performance on insufficient or species-imbalanced training data. We share the implementation of our proposed model on the GitHub. (https://github.com/Szym29/PolyADG).
Date: 2020
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008297 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 08297&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1008297
DOI: 10.1371/journal.pcbi.1008297
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().