EconPapers    
Economics at your fingertips  
 

A tm Plug-In for Distributed Text Mining in R

Stefan Theußl, Ingo Feinerer and Kurt Hornik

Journal of Statistical Software, 2012, vol. 051, issue i05

Abstract: R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) the more data to be analyzed the higher the need for efficient procedures for calculating valuable results. Fortunately, adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing data sets beyond what would fit into memory by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. In this paper we present a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size.

Date: 2012-11-13
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (3)

Downloads: (external link)
https://www.jstatsoft.org/index.php/jss/article/view/v051i05/v51i05.pdf
https://www.jstatsoft.org/index.php/jss/article/do ... ugin.dc_0.2-4.tar.gz
https://www.jstatsoft.org/index.php/jss/article/do ... ile/v051i05/v51i05.R
https://www.jstatsoft.org/index.php/jss/article/do ... /v51i05-data.tar.bz2

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:jss:jstsof:v:051:i05

DOI: 10.18637/jss.v051.i05

Access Statistics for this article

Journal of Statistical Software is currently edited by Bettina Grün, Edzer Pebesma and Achim Zeileis

More articles in Journal of Statistical Software from Foundation for Open Access Statistics
Bibliographic data for series maintained by Christopher F. Baum ().

 
Page updated 2025-03-19
Handle: RePEc:jss:jstsof:v:051:i05