Does cross-validation work in telling rankings apart?

Sziklai, Balázs R.; Baranyi, Máté; Héberger, Károly

Does cross-validation work in telling rankings apart?

Balázs R. Sziklai (), Máté Baranyi () and Károly Héberger ()
Additional contact information
Balázs R. Sziklai: HUN-REN Centre for Economic and Regional Studies
Máté Baranyi: Budapest University of Technology and Economics
Károly Héberger: HUN-REN Research Centre for Natural Sciences, Institute of Excellence, Hungarian Academy of Sciences

Central European Journal of Operations Research, 2025, vol. 33, issue 4, No 14, 1503-1528

Abstract: Abstract Although cross-validation (CV) is a standard technique in machine learning and data science, its efficacy remains largely unexplored in ranking environments. When evaluating the significance of differences, cross-validation is typically coupled with statistical testing, such as the Dietterich, Alpaydin, or Wilcoxon test. In this paper, we evaluate the power and false positive error rate of the Dietterich, Alpaydin, and Wilcoxon statistical tests combined with cross-validation each operating with folds ranging from 5 to 10, resulting in a total of 18 variants. Our testing setup utilizes a ranking framework, similar to the Sum of Ranking Differences (SRD) statistical procedure: we assume the existence of a reference ranking, and distances are measured in $$L_1$$ L 1 -norm. We test the methods under artificial scenarios as well as on real data borrowed from sports and chemistry. The choice of the optimal CV test method depends on preferences related to the minimization of errors in type I and II cases, the size of the input, and anticipated patterns in the data. Among the investigated input sizes, the Wilcoxon method with eight folds proved to be the most effective, although its performance in type I situations is subpar. While the Dietterich and Alpaydin methods excel in type I situations, they perform poorly in type II scenarios. The inadequate performances of these tests raises questions about their efficacy outside of ranking environments too.

Keywords: k-fold cross-validation; Rankings; Sum of ranking differences; Wilcoxon test; Alpaydin test; Leave-many-out; Multi-criteria decision-making (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s10100-024-00932-1 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:cejnor:v:33:y:2025:i:4:d:10.1007_s10100-024-00932-1

Ordering information: This journal article can be ordered from
http://www.springer. ... search/journal/10100

DOI: 10.1007/s10100-024-00932-1

Access Statistics for this article

Central European Journal of Operations Research is currently edited by Ulrike Leopold-Wildburger

More articles in Central European Journal of Operations Research from Springer, Slovak Society for Operations Research, Hungarian Operational Research Society, Czech Society for Operations Research, Österr. Gesellschaft für Operations Research (ÖGOR), Slovenian Society Informatika - Section for Operational Research, Croatian Operational Research Society
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().