Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

Denny, Matthew J.; Spirling, Arthur

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

Matthew J. Denny and Arthur Spirling

Political Analysis, 2018, vol. 26, issue 2, 168-189

Abstract: Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher’s substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.

Date: 2018
References: Add references at CitEc
Citations: View citations in EconPapers (43)

Downloads: (external link)
https://www.cambridge.org/core/product/identifier/ ... type/journal_article link to article abstract page (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:cup:polals:v:26:y:2018:i:02:p:168-189_00

Access Statistics for this article

More articles in Political Analysis from Cambridge University Press Cambridge University Press, UPH, Shaftesbury Road, Cambridge CB2 8BS UK.
Bibliographic data for series maintained by Kirk Stebbing ().