EconPapers    
Economics at your fingertips  
 

Text mining with n-gram variables

Matthias Schonlau (), Nick Guenther () and Ilia Sucholutsky ()
Additional contact information
Matthias Schonlau: University of Waterloo
Nick Guenther: University of Waterloo
Ilia Sucholutsky: University of Waterloo

Stata Journal, 2017, vol. 17, issue 4, 866-881

Abstract: Text mining is the process of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the command ngram, which implements the most common approach to text mining, the “bag of words”. An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of variables, each recording how often the corresponding n-gram occurs in a given text. This is more useful than it sounds. We illustrate ngram with the categorization of text answers from two open-ended questions. Copyright 2016 by StataCorp LP.

Keywords: ngram; bag of words; sets of words; unigram; gram; statistical learning; machine learning (search for similar items in EconPapers)
Date: 2017
Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj17-4/st0502/
References: Add references at CitEc
Citations: View citations in EconPapers (7)

Downloads: (external link)
http://www.stata-journal.com/article.html?article=st0502 link to article purchase

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:tsj:stataj:v:17:y:2017:i:4:p:866-881

Ordering information: This journal article can be ordered from
http://www.stata-journal.com/subscription.html

Access Statistics for this article

Stata Journal is currently edited by Nicholas J. Cox and Stephen P. Jenkins

More articles in Stata Journal from StataCorp LLC
Bibliographic data for series maintained by Christopher F. Baum () and Lisa Gilmore ().

 
Page updated 2025-03-20
Handle: RePEc:tsj:stataj:v:17:y:2017:i:4:p:866-881