CCrFS: Combine Correlation Features Selection for Detecting Phishing Websites Using Machine Learning

Moedjahedy, Jimmy; Setyanto, Arief; Alarfaj, Fawaz Khaled; Alreshoodi, Mohammed

CCrFS: Combine Correlation Features Selection for Detecting Phishing Websites Using Machine Learning

Jimmy Moedjahedy, Arief Setyanto, Fawaz Khaled Alarfaj and Mohammed Alreshoodi
Additional contact information
Jimmy Moedjahedy: Computer Science Department, Universitas Klabat, Minahasa Utara 95371, Indonesia
Arief Setyanto: Magister of Informatics Engineering, Universitas AMIKOM Yogyakarta, Yogyakarta 55281, Indonesia
Fawaz Khaled Alarfaj: Department of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University, Riyadh 11564, Saudi Arabia
Mohammed Alreshoodi: Unit of Scientific Research, Applied College, Qassim University, Buraydah 52362, Saudi Arabia

Future Internet, 2022, vol. 14, issue 8, 1-18

Abstract: Internet users are continually exposed to phishing as cybercrime in the 21st century. The objective of phishing is to obtain sensitive information by deceiving a target and using the information for financial gain. The information may include a login detail, password, date of birth, credit card number, bank account number, and family-related information. To acquire these details, users will be directed to fill out the information on false websites based on information from emails, adverts, text messages, or website pop-ups. Examining the website’s URL address is one method for avoiding this type of deception. Identifying the features of a phishing website URL takes specialized knowledge and investigation. Machine learning is one method that uses existing data to teach machines to distinguish between legal and phishing website URLs. In this work, we proposed a method that combines correlation and recursive feature elimination to determine which URL characteristics are useful for identifying phishing websites by gradually decreasing the number of features while maintaining accuracy value. In this paper, we use two datasets that contain 48 and 87 features. The first scenario combines power predictive score correlation and recursive feature elimination; the second scenario is the maximal information coefficient correlation and recursive feature elimination. The third scenario combines spearman correlation and recursive feature elimination. All three scenarios from the combined findings of the proposed methodologies achieve a high level of accuracy even with the smallest feature subset. For dataset 1, the accuracy value for the 10 features result is 97.06%, and for dataset 2 the accuracy value is 95.88% for 10 features.

Keywords: feature selection; phishing detection; machine learning; correlation; feature elimination (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://www.mdpi.com/1999-5903/14/8/229/pdf (application/pdf)
https://www.mdpi.com/1999-5903/14/8/229/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:14:y:2022:i:8:p:229-:d:873104

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().