Impact of Data Collection on ML Models: Analyzing Differences of Biases Between Low-Versus High-Skilled Annotators
Johannes Schneider (),
Daniel Eisenhardt (),
Christian Utama () and
Christian Meske ()
Additional contact information
Johannes Schneider: University of Liechtenstein
Daniel Eisenhardt: Ruhr-University Bochum
Christian Utama: Freie Universität Berlin
Christian Meske: Ruhr-University Bochum
A chapter in Solutions and Technologies for Responsible Digitalization, 2025, pp 65-80 from Springer
Abstract:
Abstract Labeled data is crucial for the success of machine learning-based artificial intelligence. However, companies often face a choice between collecting few annotations from high- or low-skilled annotators, possibly exhibiting different biases. This study investigates differences in biases between datasets labeled by said annotator groups and their impact on machine learning models. Therefore, we created high- and low-skilled annotated datasets measured the contained biases through entropy and trained different machine learning models to examine bias inheritance effects. Our findings on text sentiment annotations show both groups exhibit a considerable amount of bias in their annotations, although there is a significant difference regarding the error types commonly encountered. Models trained on biased annotations produce significantly different predictions, indicating bias propagation and tend to make more extreme errors than humans. As partial mitigation, we propose and show the efficiency of a hybrid approach where data is labeled by low-skilled and high-skilled workers.
Keywords: Annotators; Machine learning models; Bias; Labeling (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:lnichp:978-3-031-80122-8_5
Ordering information: This item can be ordered from
http://www.springer.com/9783031801228
DOI: 10.1007/978-3-031-80122-8_5
Access Statistics for this chapter
More chapters in Lecture Notes in Information Systems and Organization from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().