FKSUDDAPre: A drug–disease association prediction framework based on F-TEST feature selection and AMDKSU resampling with interpretability analysis

Zuo, Yun; Zhang, Chenyi; Hua, Ge; Ning, Qiao; Liu, Xiangrong; Zeng, Xiangxiang; Deng, Zhaohong

FKSUDDAPre: A drug–disease association prediction framework based on F-TEST feature selection and AMDKSU resampling with interpretability analysis

Yun Zuo, Chenyi Zhang, Ge Hua, Qiao Ning, Xiangrong Liu, Xiangxiang Zeng and Zhaohong Deng

PLOS Computational Biology, 2026, vol. 22, issue 2, 1-29

Abstract: In drug discovery and therapeutic research, the prediction of drug-disease associations (DDAs) holds significant scientific and clinical value. Drug molecules exert their effects by precisely identifying disease-related biological targets, systematically modulating the entire pharmacological process from absorption, distribution, and metabolism to final efficacy. Accurate prediction of drug-disease associations not only facilitates an in-depth understanding of molecular mechanisms of drug action but also provides critical theoretical foundations for drug repositioning and personalized medicine. While traditional prediction methods based on in vitro experiments and clinical statistics yield reliable results, they suffer from inherent drawbacks such as long development cycles, substantial resource consumption, and low throughput. In contrast, emerging machine learning techniques offer a promising solution to these bottlenecks, enabling the intelligent and efficient discovery of potential drug–disease association networks and significantly improving drug development efficiency. However, it is noteworthy that existing machine learning methods still face significant challenges in practical applications: the complexity of feature construction raises the threshold for data processing; data sparsity constrains the depth of information mining; and the pervasive issue of sample imbalance poses a severe challenge to the model’s predictive accuracy and generalization performance. In this study, we developed an efficient and accurate framework for drug-disease association prediction named FKSUDDAPre. The model employs a multi-modal feature fusion strategy: on one hand, it leverages an ensemble of Mol2vec and K- BERT to deeply capture the semantic features of drug molecular fingerprints; on the other hand, it integrates Medical Subject Headings (MeSH) with DeepWalk to effectively reduce the dimensionality of disease features while preserving their relational structure. To address the class imbalance problem, FKSUDDAPre designed an optimization algorithm called AMDKSU, which combined clustering with an improved distance metric strategy, significantly enhancing the discriminative power of the sample set. For data processing, F-test was employed for feature importance ranking, effectively reducing data dimensionality and improving model generalization. For the predictive architecture, FKSUDDAPre proposed a novel ensemble framework composed of XGBoost, Decision Tree, Random Forest, and HyperFast. By employing a dynamic weight allocation strategy, this ensemble effectively harnesses the complementary strengths of these models to achieve significantly enhanced predictive performance. Rigorous validation demonstrated the system’s outstanding performance across multiple evaluation metrics, with an average AUC of 0.9725, improving the AUC by approximately 3.88% compared to the best-performing baseline model. In the prediction of Alzheimer’s disease and Parkinson’s disease, 80% and 60% of the top 10 candidate drugs recommended by FKSUDDAPre, respectively, had been confirmed by literature, demonstrating the model’s good practical application potential. Furthermore, we conducted a LIME-based feature importance analysis on the model’s predictions, visualizing the correlations between features and the target variable to demonstrate the model’s interpretability. A cross-platform, user-friendly visualization tool had also been developed using the PyQt5 framework.Author summary: Drug repurposing offers a cost-effective alternative to traditional drug discovery, yet accurately predicting which existing drugs can treat specific diseases remains computationally challenging. In this study, we present FKSUDDAPre, a novel framework designed to identify potential drug-disease associations with high precision. Our approach is driven by three key innovations: first, the integration of pre-trained Large Language Models (specifically K-BERT) to capture deep semantic features of drug molecules; second, the development of the AMDKSU resampling algorithm, which effectively solves the critical issue of data imbalance to enhance model robustness; and third, the incorporation of HyperFast, a cutting-edge hypernetwork architecture, to boost classification performance. By combining these advanced components with a dynamic weighting strategy, FKSUDDAPre significantly outperforms existing baselines, achieving an average AUC of 0.9725. The framework’s practical utility was validated through case studies on Alzheimer’s and Parkinson’s diseases, where it successfully identified numerous literature-confirmed drug candidates. Furthermore, we prioritize transparency and usability by incorporating LIME-based interpretability analysis and providing a user-friendly visualization tool, making FKSUDDAPre a powerful resource for accelerating biomedical research.

Date: 2026
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1013947 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 13947&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1013947

DOI: 10.1371/journal.pcbi.1013947

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().