EconPapers    
Economics at your fingertips  
 

The textcat Package for n-Gram Based Text Categorization in R

Kurt Hornik, Patrick Mair, Johannes Rauch, Wilhelm Geiger, Christian Buchta and Ingo Feinerer

Journal of Statistical Software, 2013, vol. 052, issue i06

Abstract: Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.

Date: 2013-02-07
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (5)

Downloads: (external link)
https://www.jstatsoft.org/index.php/jss/article/view/v052i06/v52i06.pdf
https://www.jstatsoft.org/index.php/jss/article/do ... textcat_1.0-0.tar.gz
https://www.jstatsoft.org/index.php/jss/article/do ... ile/v052i06/v52i06.R
https://www.jstatsoft.org/index.php/jss/article/do ... 052i06/simTCWiki.rda

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:jss:jstsof:v:052:i06

DOI: 10.18637/jss.v052.i06

Access Statistics for this article

Journal of Statistical Software is currently edited by Bettina Grün, Edzer Pebesma and Achim Zeileis

More articles in Journal of Statistical Software from Foundation for Open Access Statistics
Bibliographic data for series maintained by Christopher F. Baum ().

 
Page updated 2025-03-19
Handle: RePEc:jss:jstsof:v:052:i06