Active learning in automated text classification: a case study exploring bias in predicted model performance metrics

Varghese, Arun; Hong, Tao; Hunter, Chelsea; Agyeman-Badu, George; Cawley, Michelle

Active learning in automated text classification: a case study exploring bias in predicted model performance metrics

Arun Varghese (), Tao Hong, Chelsea Hunter, George Agyeman-Badu and Michelle Cawley
Additional contact information
Arun Varghese: ICF
Tao Hong: ICF
Chelsea Hunter: ICF
George Agyeman-Badu: ICF
Michelle Cawley: University of North Carolina

Environment Systems and Decisions, 2019, vol. 39, issue 3, 269-280

Abstract: Abstract Machine learning has emerged as a cost-effective innovation to support systematic literature reviews in human health risk assessments and other contexts. Supervised machine learning approaches rely on a training dataset, a relatively small set of documents with human-annotated labels indicating their topic, to build models that automatically classify a larger set of unclassified documents. “Active” machine learning has been proposed as an approach that limits the cost of creating a training dataset by interactively and sequentially focussing on training only the most informative documents. We simulate active learning using a dataset of approximately 7000 abstracts from the scientific literature related to the chemical arsenic. The dataset was previously annotated by subject matter experts with regard to relevance to two topics relating to toxicology and risk assessment. We examine the performance of alternative sampling approaches to sequentially expanding the training dataset, specifically looking at uncertainty-based sampling and probability-based sampling. We discover that while such active learning methods can potentially reduce training dataset size compared to random sampling, predictions of model performance in active learning are likely to suffer from statistical bias that negates the method’s potential benefits. We discuss approaches and the extent to which the bias resulting from skewed sampling can be compensated. We propose a useful role for active learning in contexts in which the accuracy of model performance metrics is not critical and/or where it is beneficial to rapidly create a class-balanced training dataset.

Keywords: Literature review; Systematic review; Automated document classification; Machine learning; Active learning; Natural language processing (search for similar items in EconPapers)
Date: 2019
References: View complete reference list from CitEc
Citations: View citations in EconPapers (3)

Downloads: (external link)
http://link.springer.com/10.1007/s10669-019-09717-3 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:envsyd:v:39:y:2019:i:3:d:10.1007_s10669-019-09717-3

Ordering information: This journal article can be ordered from
https://www.springer.com/journal/10669

DOI: 10.1007/s10669-019-09717-3

Access Statistics for this article

More articles in Environment Systems and Decisions from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().