Variable Selection in Data Analysis: A Synthetic Data Toolkit

Mitra, Rohan; Ali, Eyad; Varam, Dara; Sulieman, Hana; Kamalov, Firuz

Variable Selection in Data Analysis: A Synthetic Data Toolkit

Rohan Mitra (), Eyad Ali, Dara Varam, Hana Sulieman () and Firuz Kamalov
Additional contact information
Rohan Mitra: Department of Computer Science and Engineering, College of Engineering, American University of Sharjah, Sharjah P.O. Box 26666, United Arab Emirates
Eyad Ali: Department of Computer Science and Engineering, College of Engineering, American University of Sharjah, Sharjah P.O. Box 26666, United Arab Emirates
Dara Varam: Department of Computer Science and Engineering, College of Engineering, American University of Sharjah, Sharjah P.O. Box 26666, United Arab Emirates
Hana Sulieman: Department of Computer Science and Engineering, College of Engineering, American University of Sharjah, Sharjah P.O. Box 26666, United Arab Emirates
Firuz Kamalov: Department of Electrical Engineering, Canadian University of Dubai, Dubai P.O. Box 117781, United Arab Emirates

Mathematics, 2024, vol. 12, issue 4, 1-29

Abstract: Variable (feature) selection plays an important role in data analysis and mathematical modeling. This paper aims to address the significant lack of formal evaluation benchmarks for feature selection algorithms (FSAs). To evaluate FSAs effectively, controlled environments are required, and the use of synthetic datasets offers significant advantages. We introduce a set of ten synthetically generated datasets with known relevance, redundancy, and irrelevance of features, derived from various mathematical, logical, and geometric sources. Additionally, eight FSAs are evaluated on these datasets based on their relevance and novelty. The paper first introduces the datasets and then provides a comprehensive experimental analysis of the performance of the selected FSAs on these datasets including testing the FSAs’ resilience on two types of induced data noise. The analysis has guided the grouping of the generated datasets into four groups of data complexity. Lastly, we provide public access to the generated datasets to facilitate bench-marking of new feature selection algorithms in the field via our Github repository. The contributions of this paper aim to foster the development of novel feature selection algorithms and advance their study.

Keywords: variable selection; data analysis; synthetic datasets; synthetic data generation; feature selection algorithms (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://www.mdpi.com/2227-7390/12/4/570/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/4/570/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:4:p:570-:d:1338492

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().