Evaluation of Feature Selection Techniques for Breast Cancer Risk Prediction
Nahúm Cueto López,
María Teresa García-Ordás,
Facundo Vitelli-Storelli,
Pablo Fernández-Navarro,
Camilo Palazuelos and
Rocío Alaiz-Rodríguez
Additional contact information
Nahúm Cueto López: Department of Electrical, Systems and Automatic Engineering, Universidad of León, Campus de Vegazana s/n, 24071 León, Spain
María Teresa García-Ordás: Department of Electrical, Systems and Automatic Engineering, Universidad of León, Campus de Vegazana s/n, 24071 León, Spain
Facundo Vitelli-Storelli: Centro de Investigación Biomédica en Red (CIBER), Grupo Investigación Interacciones Gen-Ambiente y Salud (GIIGAS), Instituto de Biomedicina (IBIOMED), Universidad de León, 24071 León, Spain
Pablo Fernández-Navarro: Cancer and Environmental Epidemiology Unit, National Center for Epidemiology, Carlos III Institute of Health, 28903 Madrid, Spain
Camilo Palazuelos: Department of Mathematics, Statistics, and Computing, University of Cantabria-IDIVAL, 39005 Santander, Spain
Rocío Alaiz-Rodríguez: Department of Electrical, Systems and Automatic Engineering, Universidad of León, Campus de Vegazana s/n, 24071 León, Spain
IJERPH, 2021, vol. 18, issue 20, 1-28
Abstract:
This study evaluates several feature ranking techniques together with some classifiers based on machine learning to identify relevant factors regarding the probability of contracting breast cancer and improve the performance of risk prediction models for breast cancer in a healthy population. The dataset with 919 cases and 946 controls comes from the MCC-Spain study and includes only environmental and genetic features. Breast cancer is a major public health problem. Our aim is to analyze which factors in the cancer risk prediction model are the most important for breast cancer prediction. Likewise, quantifying the stability of feature selection methods becomes essential before trying to gain insight into the data. This paper assesses several feature selection algorithms in terms of performance for a set of predictive models. Furthermore, their robustness is quantified to analyze both the similarity between the feature selection rankings and their own stability. The ranking provided by the SVM-RFE approach leads to the best performance in terms of the area under the ROC curve (AUC) metric. Top-47 ranked features obtained with this approach fed to the Logistic Regression classifier achieve an AUC = 0.616. This means an improvement of 5.8% in comparison with the full feature set. Furthermore, the SVM-RFE ranking technique turned out to be highly stable (as well as Random Forest), whereas relief and the wrapper approaches are quite unstable. This study demonstrates that the stability and performance of the model should be studied together as Random Forest and SVM-RFE turned out to be the most stable algorithms, but in terms of model performance SVM-RFE outperforms Random Forest.
Keywords: breast cancer; risk prediction model; feature selection; stability (search for similar items in EconPapers)
JEL-codes: I I1 I3 Q Q5 (search for similar items in EconPapers)
Date: 2021
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/1660-4601/18/20/10670/pdf (application/pdf)
https://www.mdpi.com/1660-4601/18/20/10670/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jijerp:v:18:y:2021:i:20:p:10670-:d:654096
Access Statistics for this article
IJERPH is currently edited by Ms. Jenna Liu
More articles in IJERPH from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().