Algorithm based on modified angle‐based outlier factor for open‐set classification of text documents
Tomasz Walkowiak,
Szymon Datko and
Henryk Maciejewski
Applied Stochastic Models in Business and Industry, 2018, vol. 34, issue 5, 718-729
Abstract:
This paper presents a new method of open‐set classification of text documents, with respect to subject areas. Standard (closed‐set) approaches to text classification involve training classifiers on annotated text corpora, representing a fixed number of subject areas. Such classifiers assign a new document with unknown annotation to one of the trained classes, even if the new document is not related to any class. We propose a two‐step procedure for open‐set classification. We first use a closed‐set classifier to assign a new document to one of the known classes. Then, we evaluate the (dis)similarity between the document and the chosen class using a novel criterion of outlierness named interquartile ranged angle‐based outlierness factors, which we find effective in high‐dimensional data. Based on this, we can avoid spurious assignment of documents to unrelated subject classes. We demonstrate the feasibility of this procedure in the task of subject classification of a collection of Wikipedia documents. As compared to the standard closed‐set approach, our open‐set classifier realizes significantly better precision with only small decrease of the recall measure observed in recognition of the tested classes.
Date: 2018
References: Add references at CitEc
Citations:
Downloads: (external link)
https://doi.org/10.1002/asmb.2388
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:wly:apsmbi:v:34:y:2018:i:5:p:718-729
Access Statistics for this article
More articles in Applied Stochastic Models in Business and Industry from John Wiley & Sons
Bibliographic data for series maintained by Wiley Content Delivery ().