EconPapers    
Economics at your fingertips  
 

Synergistic effects between data corpora properties and machine learning performance in data pipelines

Roberto Bertolini and Stephen J. Finch

International Journal of Data Mining, Modelling and Management, 2022, vol. 14, issue 3, 217-233

Abstract: To analyse data, a computationally feasible pipeline must be developed for data modelling. Corpora properties affect performance variability of machine learning (ML) techniques in pipelines; however, this has not been thoroughly investigated using simulation methodologies. A Monte Carlo study is used to compare differences in the area under the curve (AUC) metric for large-n-small-p-corpora examining: 1) the choice of ML algorithm; 2) size of the training database; 3) measurement error; 4) class imbalance magnitude; 5) missing data pattern. Our simulations are consistent with established results under which these algorithms and corpora properties perform best, while providing insights into their synergistic effects. Measurement error negatively impacted pipeline performance across all corpora factors and ML algorithms. A larger training corpus ameliorated the decrease in predictive efficacy resulting from measurement error, class imbalance magnitudes, and missing data patterns. We discuss the implications of these findings for designing pipelines to enhance prediction performance.

Keywords: data pipeline; interaction/synergistic effects; Monte Carlo simulation; machine learning; binary classification; area under the curve; AUC. (search for similar items in EconPapers)
Date: 2022
References: Add references at CitEc
Citations:

Downloads: (external link)
http://www.inderscience.com/link.php?id=125261 (text/html)
Access to full text is restricted to subscribers.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:ids:ijdmmm:v:14:y:2022:i:3:p:217-233

Access Statistics for this article

More articles in International Journal of Data Mining, Modelling and Management from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().

 
Page updated 2025-03-19
Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:3:p:217-233