Classification of Biodegradable Substances Using Balanced Random Trees and Boosted C5.0 Decision Trees
Alaa M. Elsayad,
Ahmed M. Nassef,
Mujahed Al-Dhaifallah and
Khaled A. Elsayad
Additional contact information
Alaa M. Elsayad: Department of Electrical Engineering, College of Engineering, Prince Sattam Bin Abdulaziz University, P.O. Box 54, Wadi Aldawaser 11991, Saudi Arabia
Ahmed M. Nassef: Department of Electrical Engineering, College of Engineering, Prince Sattam Bin Abdulaziz University, P.O. Box 54, Wadi Aldawaser 11991, Saudi Arabia
Mujahed Al-Dhaifallah: Systems Engineering Department, King Fahd University of Petroleum & Minerals, Dhahran 31261, Saudi Arabia
Khaled A. Elsayad: Pharmacy Department, Cairo University Hospitals, Cairo University, Cairo 11662, Egypt
IJERPH, 2020, vol. 17, issue 24, 1-20
Abstract:
Substances that do not degrade over time have proven to be harmful to the environment and are dangerous to living organisms. Being able to predict the biodegradability of substances without costly experiments is useful. Recently, the quantitative structure–activity relationship (QSAR) models have proposed effective solutions to this problem. However, the molecular descriptor datasets usually suffer from the problems of unbalanced class distribution, which adversely affects the efficiency and generalization of the derived models. Accordingly, this study aims at validating the performances of balanced random trees (RTs) and boosted C5.0 decision trees (DTs) to construct QSAR models to classify the ready biodegradation of substances and their abilities to deal with unbalanced data. The balanced RTs model algorithm builds individual trees using balanced bootstrap samples, while the boosted C5.0 DT is modeled using cost-sensitive learning. We employed the two-dimensional molecular descriptor dataset, which is publicly available through the University of California, Irvine (UCI) machine learning repository. The molecular descriptors were ranked according to their contributions to the balanced RTs classification process. The performance of the proposed models was compared with previously reported results. Based on the statistical measures, the experimental results showed that the proposed models outperform the classification results of the support vector machine (SVM), K-nearest neighbors (KNN), and discrimination analysis (DA). Classification measures were analyzed in terms of accuracy, sensitivity, specificity, precision, false positive rate, false negative rate, F1 score, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUROC).
Keywords: QSAR; biodegradable substances; machine learning; random trees; C5.0 decision tree; support vector machine; K-nearest neighbors; discrimination analysis (search for similar items in EconPapers)
JEL-codes: I I1 I3 Q Q5 (search for similar items in EconPapers)
Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)
Downloads: (external link)
https://www.mdpi.com/1660-4601/17/24/9322/pdf (application/pdf)
https://www.mdpi.com/1660-4601/17/24/9322/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jijerp:v:17:y:2020:i:24:p:9322-:d:461317
Access Statistics for this article
IJERPH is currently edited by Ms. Jenna Liu
More articles in IJERPH from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().