Selecting More Informative Training Sets with Fewer Observations
Aaron Kaufman
Political Analysis, 2024, vol. 32, issue 1, 133-139
Abstract:
A standard text-as-data workflow in the social sciences involves identifying a set of documents to be labeled, selecting a random sample of them to label using research assistants, training a supervised learner to label the remaining documents, and validating that model’s performance using standard accuracy metrics. The most resource-intensive component of this is the hand-labeling: carefully reading documents, training research assistants, and paying human coders to label documents in duplicate or more. We show that hand-coding an algorithmically selected rather than a simple-random sample can improve model performance above baseline by as much as 50%, or reduce hand-coding costs by up to two-thirds, in applications predicting (1) U.S. executive-order significance and (2) financial sentiment on social media. We accompany this manuscript with open-source software to implement these tools, which we hope can make supervised learning cheaper and more accessible to researchers.
Date: 2024
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.cambridge.org/core/product/identifier/ ... type/journal_article link to article abstract page (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:cup:polals:v:32:y:2024:i:1:p:133-139_9
Access Statistics for this article
More articles in Political Analysis from Cambridge University Press Cambridge University Press, UPH, Shaftesbury Road, Cambridge CB2 8BS UK.
Bibliographic data for series maintained by Kirk Stebbing ().