EconPapers    
Economics at your fingertips  
 

An efficient focused crawler using LSTM-CNN based deep learning

Gourav Kumar Shrivastava (), Rajesh Kumar Pateriya () and Praveen Kaushik ()
Additional contact information
Gourav Kumar Shrivastava: MANIT
Rajesh Kumar Pateriya: MANIT
Praveen Kaushik: MANIT

International Journal of System Assurance Engineering and Management, 2023, vol. 14, issue 1, No 34, 407 pages

Abstract: Abstract Focused Crawler searches the internet for topic-specific web pages. Its effectiveness is determined on the multidimensional nature of the web pages. The main task of any Focused Crawler is to collect relevant web pages of predefined topics and neglecting the irrelevant web pages. Traditional Best-First based Focused Crawlers (FC) are based on Vector Space Model (VSM) which uses Term Frequency-Inverse Document Frequency (TF-IDF) that gives limited success rate on the web page classification. The major practical challenge associated with Focused Crawler is to correctly classify the web pages based on the given topic due to the unstructured data in web pages. The main objective of this work is to design an improved focused Crawling approach using web page classification. This work proposes a text classification model based on Long Short Term Memory (LSTM) and Convolutional Neural Network (CNN) with word embeddings to increase the accuracy of web page classification. The LSTM-CNN based text classification model is further used to guide the Focused Crawler for classification of web pages. The proposed text classification model is implemented by combining the LSTM with CNN. The validation of the proposed LSTM-CNN text classification model is carried out on different datasets and results are then compared with traditional supervised machine learning algorithms and different deep neural network (DNN) based approaches like CNN, RNN and RCNN.The suggested text classification model performs 8–12 percent better than typical supervised machine learning algorithms and 4–6 percent better than CNN, RNN, and RCNN, according to experimental results. Also, the improved focused crawling approach with LSTM-CNN based text classification model gives increasing harvest rate and target recall as compared to the Breadth-First Crawler, Best-First Crawler,CNN Crawler and DNN Crawler.

Keywords: Deep learning; Focused crawling; Machine learning (search for similar items in EconPapers)
Date: 2023
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s13198-022-01808-w Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:ijsaem:v:14:y:2023:i:1:d:10.1007_s13198-022-01808-w

Ordering information: This journal article can be ordered from
http://www.springer.com/engineering/journal/13198

DOI: 10.1007/s13198-022-01808-w

Access Statistics for this article

International Journal of System Assurance Engineering and Management is currently edited by P.K. Kapur, A.K. Verma and U. Kumar

More articles in International Journal of System Assurance Engineering and Management from Springer, The Society for Reliability, Engineering Quality and Operations Management (SREQOM),India, and Division of Operation and Maintenance, Lulea University of Technology, Sweden
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-20
Handle: RePEc:spr:ijsaem:v:14:y:2023:i:1:d:10.1007_s13198-022-01808-w