Simple and efficient classification scheme based on specific vocabulary
Jacques Savoy () and
Olena Zubaryeva ()
Computational Management Science, 2012, vol. 9, issue 3, 415 pages
Abstract:
Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms (character n-gram, word, stem, lemma or sequence of them) which characterize a document. We then show how these Z score values can be used to derive a simple and efficient categorization scheme. To evaluate this proposition and demonstrate its effectiveness, we develop two experiments. First, the system must categorize speeches given by B. Obama as being either electoral or presidential speech. In a second experiment, sentences are extracted from these speeches and then categorized under the headings electoral or presidential. Based on these evaluations, the proposed classification scheme tends to perform better than a support vector machine model for both experiments, on the one hand, and on the other, shows a better performance level than a Naïve Bayes classifier on the first test and a slightly lower performance on the second (10-fold cross validation). Copyright Springer-Verlag 2012
Keywords: Statistics in lexical analysis; Corpus linguistics; Text categorization; Machine learning; Natural language processing (NLP) (search for similar items in EconPapers)
Date: 2012
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://hdl.handle.net/10.1007/s10287-012-0149-z (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:comgts:v:9:y:2012:i:3:p:401-415
Ordering information: This journal article can be ordered from
http://www.springer. ... ch/journal/10287/PS2
DOI: 10.1007/s10287-012-0149-z
Access Statistics for this article
Computational Management Science is currently edited by Ruediger Schultz
More articles in Computational Management Science from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().