Effective Online Knowledge Distillation via Attention-Based Model Ensembling
Diana-Laura Borza,
Adrian Sergiu Darabant (),
Tudor Alexandru Ileni and
Alexandru-Ion Marinescu
Additional contact information
Diana-Laura Borza: Computer Science Department, Babes Bolyai University, 400084 Cluj-Napoca, Romania
Adrian Sergiu Darabant: Computer Science Department, Babes Bolyai University, 400084 Cluj-Napoca, Romania
Tudor Alexandru Ileni: Computer Science Department, Babes Bolyai University, 400084 Cluj-Napoca, Romania
Alexandru-Ion Marinescu: Computer Science Department, Babes Bolyai University, 400084 Cluj-Napoca, Romania
Mathematics, 2022, vol. 10, issue 22, 1-15
Abstract:
Large-scale deep learning models have achieved impressive results on a variety of tasks; however, their deployment on edge or mobile devices is still a challenge due to the limited available memory and computational capability. Knowledge distillation is an effective model compression technique, which can boost the performance of a lightweight student network by transferring the knowledge from a more complex model or an ensemble of models. Due to its reduced size, this lightweight model is more suitable for deployment on edge devices. In this paper, we introduce an online knowledge distillation framework, which relies on an original attention mechanism to effectively combine the predictions of a cohort of lightweight (student) networks into a powerful ensemble, and use this as a distillation signal. The proposed aggregation strategy uses the predictions of the individual students as well as ground truth data to determine a set of weights needed for ensembling these predictions. This mechanism is solely used during system training. When testing or at inference time, a single, lightweight student is extracted and used. The extensive experiments we performed on several image classification benchmarks, both by training models from scratch (on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets) and using transfer learning (on Oxford Pets and Oxford Flowers datasets), showed that the proposed framework always leads to an improvement in the accuracy of knowledge-distilled students and demonstrates the effectiveness of the proposed solution. Moreover, in the case of ResNet architecture, we observed that the knowledge-distilled model achieves a higher accuracy than a deeper, individually trained ResNet model.
Keywords: online knowledge distillation; ensemble learning; attention aggregation; deep learning (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2022
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/10/22/4285/pdf (application/pdf)
https://www.mdpi.com/2227-7390/10/22/4285/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:10:y:2022:i:22:p:4285-:d:974201
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().