Statistical keyword detection in literary corpora
J. P. Herrera () and
P. A. Pury ()
The European Physical Journal B: Condensed Matter and Complex Systems, 2008, vol. 63, issue 1, 135-146
Abstract:
Understanding the complexity of human language requires an appropriate analysis of the statistical distribution of words in texts. We consider the information retrieval problem of detecting and ranking the relevant words of a text by means of statistical information referring to the spatial use of the words. Shannon's entropy of information is used as a tool for automatic keyword extraction. By using The Origin of Species by Charles Darwin as a representative text sample, we show the performance of our detector and compare it with another proposals in the literature. The random shuffled text receives special attention as a tool for calibrating the ranking indices. Copyright EDP Sciences/Società Italiana di Fisica/Springer-Verlag 2008
Keywords: 89.70.+c Information theory and communication theory; 05.45.Tp Time series analysis; 89.75.-k Complex systems (search for similar items in EconPapers)
Date: 2008
References: View complete reference list from CitEc
Citations: View citations in EconPapers (7)
Downloads: (external link)
http://hdl.handle.net/10.1140/epjb/e2008-00206-x (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:eurphb:v:63:y:2008:i:1:p:135-146
Ordering information: This journal article can be ordered from
http://www.springer.com/economics/journal/10051
DOI: 10.1140/epjb/e2008-00206-x
Access Statistics for this article
The European Physical Journal B: Condensed Matter and Complex Systems is currently edited by P. Hänggi and Angel Rubio
More articles in The European Physical Journal B: Condensed Matter and Complex Systems from Springer, EDP Sciences
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().