Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling
Anton Thielmann,
Christoph Weisser,
Astrid Krenz and
Benjamin Säfken
Journal of Applied Statistics, 2023, vol. 50, issue 3, 574-591
Abstract:
Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.
Date: 2023
References: Add references at CitEc
Citations:
Downloads: (external link)
http://hdl.handle.net/10.1080/02664763.2021.1919063 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:taf:japsta:v:50:y:2023:i:3:p:574-591
Ordering information: This journal article can be ordered from
http://www.tandfonline.com/pricing/journal/CJAS20
DOI: 10.1080/02664763.2021.1919063
Access Statistics for this article
Journal of Applied Statistics is currently edited by Robert Aykroyd
More articles in Journal of Applied Statistics from Taylor & Francis Journals
Bibliographic data for series maintained by Chris Longhurst ().