Robustness Analysis of a Website Categorization Procedure based on Machine Learning
Renato Bruni () and
Gianpiero Bianchi ()
Additional contact information
Renato Bruni: Department of Computer, Control and Management Engineering Antonio Ruberti (DIAG), University of Rome La Sapienza, Rome, Italy
Gianpiero Bianchi: Direzione centrale per la metodologia e disegno dei processi statistici (DCME),Italian National Institute of Statistics Istat, Rome, Italy
No 2018-04, DIAG Technical Reports from Department of Computer, Control and Management Engineering, Universita' degli Studi di Roma "La Sapienza"
Abstract:
Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used to accomplish statistical surveys, saving the cost of the surveys, or to validate already surveyed data. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a dicult task in practice. This work describes techniques that can be used to convert website categorization into a supervised classification problem. To do so, each data record should summarize the content of an entire website. We generate this kind of records by using web scraping and optical character recognition, followed by a number of automated feature engineering steps. When such records have been produced, we apply to them state-of-the-art classification techniques to categorize the websites according to the aspect of interest. We use Support Vector Machines, Random Forest and Logistic classifiers. Since in many applicative cases the labels available for the training set may be noisy, we analyze the robustness of our procedure with respect to the presence of misclassified training records. We present results on real-world data for the problem of the detection of websites providing e-commerce facilities.
Keywords: Classification; Machine Learning; Feature Engineering; Text (search for similar items in EconPapers)
Date: 2018
New Economics Papers: this item is included in nep-big and nep-cmp
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://wwwold.dis.uniroma1.it/~bibdis/RePEc/aeg/report/2018-04.pdf First version, 2018 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:aeg:report:2018-04
Access Statistics for this paper
More papers in DIAG Technical Reports from Department of Computer, Control and Management Engineering, Universita' degli Studi di Roma "La Sapienza" Contact information at EDIRC.
Bibliographic data for series maintained by Antonietta Angelica Zucconi ( this e-mail address is bad, please contact ).