EconPapers    
Economics at your fingertips  
 

Distributed heterogeneous ensemble learning on Apache Spark for ligand-based virtual screening

Karima Sid and Mohamed Batouche

International Journal of Data Mining, Modelling and Management, 2021, vol. 13, issue 1/2, 160-191

Abstract: Virtual screening is one of the most common computer-aided drug design techniques that apply computational tools and methods on large libraries of molecules to extract the drugs. Ensemble learning is a recent paradigm launched to improve machine learning results in terms of predictive performance and robustness. It has been successfully applied in ligand-based virtual screening (LBVS) approaches. Applying ensemble learning on huge molecular libraries is computationally expensive. Hence, the distribution and parallelisation of the task have become a significant step by using sophisticated frameworks such as Apache Spark. In this paper, we propose a new approach HEnsL_DLBVS, for heterogeneous ensemble learning, distributed on Spark to improve the large-scale LBVS results. To handle the problem of imbalanced big training datasets, we propose a novel hybrid technique. We generate new training datasets to evaluate the approach. Experimental results confirm the effectiveness of our approach with satisfactory accuracy and its superiority over homogeneous models.

Keywords: virtual screening; big data; computer-aided drug design; CADD; Apache Spark; machine learning; drug discovery; ensemble learning; imbalanced datasets; Spark MLlib; ligand-based virtual screening; LBVS. (search for similar items in EconPapers)
Date: 2021
References: Add references at CitEc
Citations:

Downloads: (external link)
http://www.inderscience.com/link.php?id=112920 (text/html)
Access to full text is restricted to subscribers.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:ids:ijdmmm:v:13:y:2021:i:1/2:p:160-191

Access Statistics for this article

More articles in International Journal of Data Mining, Modelling and Management from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().

 
Page updated 2025-03-19
Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:1/2:p:160-191