Text mining with n-gram variables
Matthias Schonlau (),
Nick Guenther () and
Ilia Sucholutsky ()
Additional contact information
Matthias Schonlau: University of Waterloo
Nick Guenther: University of Waterloo
Ilia Sucholutsky: University of Waterloo
Stata Journal, 2017, vol. 17, issue 4, 866-881
Abstract:
Text mining is the process of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the command ngram, which implements the most common approach to text mining, the “bag of words”. An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of variables, each recording how often the corresponding n-gram occurs in a given text. This is more useful than it sounds. We illustrate ngram with the categorization of text answers from two open-ended questions. Copyright 2016 by StataCorp LP.
Keywords: ngram; bag of words; sets of words; unigram; gram; statistical learning; machine learning (search for similar items in EconPapers)
Date: 2017
Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj17-4/st0502/
References: Add references at CitEc
Citations: View citations in EconPapers (7)
Downloads: (external link)
http://www.stata-journal.com/article.html?article=st0502 link to article purchase
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:tsj:stataj:v:17:y:2017:i:4:p:866-881
Ordering information: This journal article can be ordered from
http://www.stata-journal.com/subscription.html
Access Statistics for this article
Stata Journal is currently edited by Nicholas J. Cox and Stephen P. Jenkins
More articles in Stata Journal from StataCorp LLC
Bibliographic data for series maintained by Christopher F. Baum () and Lisa Gilmore ().