Development and validation of a machine learning model for on-site prediction of coronary heart disease in high-risk adults using clinical data

Mo, Liwen; Lin, Hua; Li, Chengxuan; Yu, Lifei; Lu, Decheng

Development and validation of a machine learning model for on-site prediction of coronary heart disease in high-risk adults using clinical data

Liwen Mo, Hua Lin, Chengxuan Li, Lifei Yu and Decheng Lu

PLOS ONE, 2025, vol. 20, issue 11, 1-14

Abstract: Background: Risk of coronary heart disease (CHD) in a specific period of years can be assessed using scores calculated by models, such as pooled cohort equations (PCEs) and Framingham Risk Score. However, there are few studies on on-site estimation of CHD risk quantitatively with score calculation as auxiliary diagnosis. Nowadays, researchers introduce new technologies, such as machine learning, as effective CHD risk prediction models, but these models still need to be validated using real clinical data before promoting their use in real clinical settings. Objective: The aim of this study is to predict CHD risk for high-risk population only using clinical data consisting of vital traits, lab measurement, diagnosis, medical device testing and medications. The prediction model can serve as an on-site quantitative indicator for the CHD risk of potential patients before diagnosis using coronary arteriography. Methods: This work is designed as a retrospective study of a hospital-based cohort (The Second Affiliated Hospital of Guangxi Medical University), comprising 20,821 patients with CHD and 9,796 controls from 2017 to 2024. A two-layer machine learning model (TLML) is developed on the prediction results of the random forest and the gradient boosting decision tree to combine the merits of both models. The models were trained and validated with the clinical data in the cohort. Results: The TLML presented in this study can have a good accuracy (0.79, 95% CI 0.79–0.80), sensitivity (0.79, 95% CI 0.79–0.80) and specificity (0.79, 95% CI 0.79–0.79) for on-site CHD prediction. Compared with the PCEs (accuracy = 0.59, sensitivity = 0.58 and specificity = 0.60), the TLML shows remarkably better on-site CHD prediction performance. Predictor importance analysis results show that age, diabetes, antihypertensive medications, total bilirubin, hypertension, obstructive sleep apnea-hypopnea syndrome, red cell count, hemoglobin, cystatin C, retinol-binding protein, gender and low-density lipoprotein cholesterol level are the most important variables for on-site CHD prediction. All the features mentioned were reported to have relationship with CHD on some levels in previous studies. A reduced complexity model is also presented to provide decent CHD prediction with only 20 predictors to increase model practicality, achieving a prediction accuracy of 0.73. Conclusions: The machine learning models presented in this study have the potential to become auxiliary on-site diagnostics tool of CHD because of its capability for accurate prediction and easy availability of all the required prediction variables.

Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0334881 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 34881&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0334881

DOI: 10.1371/journal.pone.0334881

Access Statistics for this article

More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().