EconPapers    
Economics at your fingertips  
 

Algorithm based on modified angle‐based outlier factor for open‐set classification of text documents

Tomasz Walkowiak, Szymon Datko and Henryk Maciejewski

Applied Stochastic Models in Business and Industry, 2018, vol. 34, issue 5, 718-729

Abstract: This paper presents a new method of open‐set classification of text documents, with respect to subject areas. Standard (closed‐set) approaches to text classification involve training classifiers on annotated text corpora, representing a fixed number of subject areas. Such classifiers assign a new document with unknown annotation to one of the trained classes, even if the new document is not related to any class. We propose a two‐step procedure for open‐set classification. We first use a closed‐set classifier to assign a new document to one of the known classes. Then, we evaluate the (dis)similarity between the document and the chosen class using a novel criterion of outlierness named interquartile ranged angle‐based outlierness factors, which we find effective in high‐dimensional data. Based on this, we can avoid spurious assignment of documents to unrelated subject classes. We demonstrate the feasibility of this procedure in the task of subject classification of a collection of Wikipedia documents. As compared to the standard closed‐set approach, our open‐set classifier realizes significantly better precision with only small decrease of the recall measure observed in recognition of the tested classes.

Date: 2018
References: Add references at CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1002/asmb.2388

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:wly:apsmbi:v:34:y:2018:i:5:p:718-729

Access Statistics for this article

More articles in Applied Stochastic Models in Business and Industry from John Wiley & Sons
Bibliographic data for series maintained by Wiley Content Delivery ().

 
Page updated 2025-03-20
Handle: RePEc:wly:apsmbi:v:34:y:2018:i:5:p:718-729