Selecting text entries using a few positive samples and similarity ranking
Jan Žižka,
Arnošt Svoboda and
František Dařena
Additional contact information
Jan Žižka: Ústav informatiky, Mendelova univerzita v Brně, Zemědělská 1, 613 00 Brno, Česká republika
Arnošt Svoboda: Katedra aplikované matematiky a informatiky, Ekonomicko-správní fakulta, Masarykova univerzita, Lipová 41a, 602 00 Brno, Česká republika
Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis, 2011, vol. 59, issue 4, 399-408
Abstract:
This research was inspired by procedures that are used by human bibliographic searchers: Given some textual and only 'positive' (relevant, interesting) examples coming just from one category, find promptly and simply in an available collection of various unlabeled documents the most similar ones that belong to a relevant topic defined by an applicant. The problem of the categorization of unlabeled relevant and irrelevant textual documents is here solved by using a small subset of relevant available patterns labeled manually in advance. Unlabeled text items are compared with such labeled patterns. The unlabeled samples are then ranked according their degree of similarity with the patterns. At the top of the rank, there are the most similar (relevant) items. Entries receding from the rank top represent gradually less and less similar entries. The authors emphasize that this simple method, aimed at processing large volumes of text entries, provides initial filtering results from the accuracy point of view and the users can avoid the demanding task of labeling too many training examples to be able to apply a chosen classifier, and at the same time, they can obtain quickly the relevant items. The ranking-based approach gives results that can be possibly further used for the following text-item processing where the number of irrelevant items is already not so high as at the beginning. Even if this relatively simple automatic search is not errorless due to the overlapping of documents, it can help process particularly very large unstructured textual data volumes.
Keywords: unlabeled text documents; one-class categorization; text similarity; ranking by similarity; pattern recognition; machine learning; natural language processing; non-semantic documents (search for similar items in EconPapers)
Date: 2011
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
http://acta.mendelu.cz/doi/10.11118/actaun201159040399.html (text/html)
http://acta.mendelu.cz/doi/10.11118/actaun201159040399.pdf (application/pdf)
free of charge
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:mup:actaun:actaun_2011059040399
DOI: 10.11118/actaun201159040399
Access Statistics for this article
Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis is currently edited by Markéta Havlásková
More articles in Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis from Mendel University Press
Bibliographic data for series maintained by Ivo Andrle ().