Approaches to samples selection for machine learning based classification of textual data
František Dařena and
Jan Zizka ()
Additional contact information
Jan Zizka: Department of Informatics, Faculty of Business and Economics, Mendel University in Brno
No 2011-11, MENDELU Working Papers in Business and Economics from Mendel University in Brno, Faculty of Business and Economics
Abstract:
The paper focuses on retrieval of relevant documents written in a natural language based on availability of several candidate examples which are used as the basis for the automatic selection of only items that are similar to these predefined patterns. Presented approach should face problems related to processing user created content in natural language that include a poor control over the topic and the structure of the content and often also huge computational complexity. Three methods of selecting the best samples from a large set of candidate samples are presented - random selection, manual selection and a new approach called automatic biased sample selection, and measures based on Euclidean distance and cosine similarity are used for classification. The experiments are carried out with real world data consisting of customer reviews downloaded from amazon.com, converted to different representations based on bag-of-words procedure. The experiments and the results of the presented approach provided satisfactory values and can lead to an alternative approach to manual selection and evaluation of textual samples.
Keywords: text classification; textual patterns; machine learning; natural language processing; text similarity (search for similar items in EconPapers)
JEL-codes: C38 C89 (search for similar items in EconPapers)
Pages: 19
Date: 2011-07
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)
Downloads: (external link)
http://ftp.mendelu.cz/RePEc/men/wpaper/11_2011.pdf Full text (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:men:wpaper:11_2011
Access Statistics for this paper
More papers in MENDELU Working Papers in Business and Economics from Mendel University in Brno, Faculty of Business and Economics Contact information at EDIRC.
Bibliographic data for series maintained by Luděk Kouba ().