Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Pargent, Florian; Pfisterer, Florian; Thomas, Janek; Bischl, Bernd

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Florian Pargent (), Florian Pfisterer, Janek Thomas and Bernd Bischl
Additional contact information
Florian Pargent: Psychological Methods and Assessment, LMU Munich
Florian Pfisterer: Statistical Learning and Data Science, LMU Munich
Janek Thomas: Statistical Learning and Data Science, LMU Munich
Bernd Bischl: Statistical Learning and Data Science, LMU Munich

Computational Statistics, 2022, vol. 37, issue 5, No 21, 2692 pages

Abstract: Abstract Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm’s predictive performance, and—if possible—derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass–classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison.

Keywords: Supervised machine learning; Benchmark; High-cardinality categorical features; Target encoding; Dummy encoding; Generalized linear mixed models (search for similar items in EconPapers)
Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (3)

Downloads: (external link)
http://link.springer.com/10.1007/s00180-022-01207-6 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:compst:v:37:y:2022:i:5:d:10.1007_s00180-022-01207-6

Ordering information: This journal article can be ordered from
http://www.springer.com/statistics/journal/180/PS2

DOI: 10.1007/s00180-022-01207-6

Access Statistics for this article

Computational Statistics is currently edited by Wataru Sakamoto, Ricardo Cao and Jürgen Symanzik

More articles in Computational Statistics from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().