TEXT CATEGORIZATION USING ONLY FRAGMENTS OF DOCUMENTS
Istvan Pilaszy and
Tadeusz Dobrowiecki
GAZDÁLKODÁS: Scientific Journal on Agricultural Economics, 2007, vol. 51, issue Special Edition 19, 8
Abstract:
In this paper we presented a lot of experiments that examine how the particular parts of the documents do contribute to the performance of a classifier. We evaluated text classifiers on two very different text corpora. We conclude that some parts of the text are more important from the point of text classification performance. Giving higher weights to more important parts can increase the performance of the classifier. The question, that which parts are more or less important depends on the nature of the documents in the corpora. Some tasks that remains to be done: − More text corpora should be investigated. − In section 6.4 we optimized the number of features to be kept independent from the section. However, it could be optimized for each section. − Splitting the documents into parts of 50 words, to examine what if the parts are of equal size not only inside a document, but among the documents too. − When splitting documents into k equal parts, we may combine the classifiers resulted from different k values.
Keywords: Research; and; Development/Tech; Change/Emerging; Technologies (search for similar items in EconPapers)
Date: 2007
References: Add references at CitEc
Citations:
Downloads: (external link)
https://ageconsearch.umn.edu/record/58927/files/Pi ... 07_19ksz_214_211.pdf (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:ags:gazdal:58927
DOI: 10.22004/ag.econ.58927
Access Statistics for this article
More articles in GAZDÁLKODÁS: Scientific Journal on Agricultural Economics from Karoly Robert University College Contact information at EDIRC.
Bibliographic data for series maintained by AgEcon Search ().