On the (Mis)Use of Machine Learning with Panel Data
Augusto Cerqua,
Marco Letta and
Gabriele Pinto
Papers from arXiv.org
Abstract:
We provide the first systematic assessment of data leakage issues in the use of machine learning on panel data. Our organizing framework clarifies why neglecting the cross-sectional and longitudinal structure of these data leads to hard-to-detect data leakage, inflated out-of-sample performance, and an inadvertent overestimation of the real-world usefulness and applicability of machine learning models. We then offer empirical guidelines for practitioners to ensure the correct implementation of supervised machine learning in panel data environments. An empirical application, using data from over 3,000 U.S. counties spanning 2000-2019 and focused on income prediction, illustrates the practical relevance of these points across nearly 500 models for both classification and regression tasks.
Date: 2024-11, Revised 2025-05
New Economics Papers: this item is included in nep-big, nep-cmp, nep-ecm and nep-pke
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://arxiv.org/pdf/2411.09218 Latest version (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:arx:papers:2411.09218
Access Statistics for this paper
More papers in Papers from arXiv.org
Bibliographic data for series maintained by arXiv administrators ().