EconPapers    
Economics at your fingertips  
 

Improving blog spam filters via machine learning

Weiwen Yang and Linchi Kwok

International Journal of Data Analysis Techniques and Strategies, 2017, vol. 9, issue 2, 99-121

Abstract: As an important platform of electronic commerce, blogs can greatly influence internet users' purchasing decisions. Spam, however, can substantially reduce blogs' positive impact on electronic commerce. This paper introduces SK, an alternative algorithm combining supervised learning (SVM) and unsupervised learning (K-means++) to detect blog spam. If either classifies a blog as spam, then the blog is assigned to the spam category. Feature selection includes term frequency, inverse document frequency, binary representation, stop words, outgoing links, advertiser content, and burst with keywords. Accuracy of each model was tested and compared in experiments with 3,000 blog pages from University of Maryland and 3,560 internet blogs. Findings suggest that combining the SVM algorithm and K-means++ clustering can increase accuracy of filtering spams by about 7% as compared to using just one of these methods. Strengths and weaknesses of various spam-filtering methods were discussed, providing considerations for businesses when choosing a spam filter.

Keywords: spam filter; support vector machine; SVM; K-means++; machine learning; neural network. (search for similar items in EconPapers)
Date: 2017
References: Add references at CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
http://www.inderscience.com/link.php?id=85901 (text/html)
Access to full text is restricted to subscribers.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:ids:injdan:v:9:y:2017:i:2:p:99-121

Access Statistics for this article

More articles in International Journal of Data Analysis Techniques and Strategies from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().

 
Page updated 2025-03-19
Handle: RePEc:ids:injdan:v:9:y:2017:i:2:p:99-121