EconPapers    
Economics at your fingertips  
 

DBSCAN Algorithm for Document Clustering

Creţulescu Radu G., Morariu Daniel I., Breazu Macarie and Volovici Daniel
Additional contact information
Creţulescu Radu G.: „Lucian Blaga” University of Sibiu, Engineering Faculty, Computer Science and Electrical and Electronics Engineering Department
Morariu Daniel I.: „Lucian Blaga” University of Sibiu, Engineering Faculty, Computer Science and Electrical and Electronics Engineering Department
Breazu Macarie: „Lucian Blaga” University of Sibiu, Engineering Faculty, Computer Science and Electrical and Electronics Engineering Department
Volovici Daniel: „Lucian Blaga” University of Sibiu, Engineering Faculty, Computer Science and Electrical and Electronics Engineering Department

International Journal of Advanced Statistics and IT&C for Economics and Life Sciences, 2019, vol. 9, issue 1, 58-66

Abstract: Document clustering is a problem of automatically grouping similar document into categories based on some similarity metrics. Almost all available data, usually on the web, are unclassified so we need powerful clustering algorithms that work with these types of data. All common search engines return a list of pages relevant to the user query. This list needs to be generated fast and as correct as possible. For this type of problems, because the web pages are unclassified, we need powerful clustering algorithms. In this paper we present a clustering algorithm called DBSCAN – Density-Based Spatial Clustering of Applications with Noise – and its limitations on documents (or web pages) clustering. Documents are represented using the “bag-of-words” representation (word occurrence frequency). For this type o representation usually a lot of algorithms fail. In this paper we use Information Gain as feature selection method and evaluate the DBSCAN algorithm by its capacity to integrate in the clusters all the samples from the dataset.

Keywords: Document Classification; Information Gain; Naive Bayes; Weka framework (search for similar items in EconPapers)
Date: 2019
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://doi.org/10.2478/ijasitels-2019-0007 (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:vrs:ijsiel:v:9:y:2019:i:1:p:58-66:n:7

DOI: 10.2478/ijasitels-2019-0007

Access Statistics for this article

International Journal of Advanced Statistics and IT&C for Economics and Life Sciences is currently edited by Daniel Volovici

More articles in International Journal of Advanced Statistics and IT&C for Economics and Life Sciences from Sciendo
Bibliographic data for series maintained by Peter Golla ().

 
Page updated 2025-03-20
Handle: RePEc:vrs:ijsiel:v:9:y:2019:i:1:p:58-66:n:7