EconPapers    
Economics at your fingertips  
 

Approaches to samples selection for machine learning based classification of textual data

František Dařena and Jan Zizka ()
Additional contact information
Jan Zizka: Department of Informatics, Faculty of Business and Economics, Mendel University in Brno

No 2011-11, MENDELU Working Papers in Business and Economics from Mendel University in Brno, Faculty of Business and Economics

Abstract: The paper focuses on retrieval of relevant documents written in a natural language based on availability of several candidate examples which are used as the basis for the automatic selection of only items that are similar to these predefined patterns. Presented approach should face problems related to processing user created content in natural language that include a poor control over the topic and the structure of the content and often also huge computational complexity. Three methods of selecting the best samples from a large set of candidate samples are presented - random selection, manual selection and a new approach called automatic biased sample selection, and measures based on Euclidean distance and cosine similarity are used for classification. The experiments are carried out with real world data consisting of customer reviews downloaded from amazon.com, converted to different representations based on bag-of-words procedure. The experiments and the results of the presented approach provided satisfactory values and can lead to an alternative approach to manual selection and evaluation of textual samples.

Keywords: text classification; textual patterns; machine learning; natural language processing; text similarity (search for similar items in EconPapers)
JEL-codes: C38 C89 (search for similar items in EconPapers)
Pages: 19
Date: 2011-07
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
http://ftp.mendelu.cz/RePEc/men/wpaper/11_2011.pdf Full text (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:men:wpaper:11_2011

Access Statistics for this paper

More papers in MENDELU Working Papers in Business and Economics from Mendel University in Brno, Faculty of Business and Economics Contact information at EDIRC.
Bibliographic data for series maintained by Luděk Kouba ().

 
Page updated 2025-03-19
Handle: RePEc:men:wpaper:11_2011