TinyML model compression: A comparative study of pruning and quantization on selected standard and custom neural networks

Shabir, Muhammad Yasir; Torta, Gianluca; Damiani, Ferruccio

TinyML model compression: A comparative study of pruning and quantization on selected standard and custom neural networks

Muhammad Yasir Shabir (), Gianluca Torta () and Ferruccio Damiani ()
Additional contact information
Muhammad Yasir Shabir: University of Turin
Gianluca Torta: University of Turin
Ferruccio Damiani: University of Turin

Telecommunication Systems: Modelling, Analysis, Design and Management, 2025, vol. 88, issue 4, No 18, 21 pages

Abstract: Abstract In Machine Learning (ML), the deployment of complex Neural Network (NN) models on memory-constrained Internet of Things (IoT) devices presents a significant challenge. Tiny Machine Learning (TinyML) focuses on optimizing NN models for such environments, where computational and storage resources are limited. A major aspect of this optimization involves reducing model size without substantially compromising accuracy. We conducted a systematic literature review to identify pruning and quantization techniques suitable for optimization of NN models. In addition, this study investigates the efficiency of pruning and 8-bit integer (INT8) quantization in optimizing NN models for deployment on memory-constrained devices. The study evaluates widely used NN architectures such as ResNet50/101, VGG16, and MobileNet, alongside a custom-designed model, using CIFAR-100, CIFAR-10, MNIST, and Fashion-MNIST datasets. The results show that combining pruning with INT8 quantization reduced the size of MobileNet by 77.01% and the custom model by 94.38%. Notably, the custom model achieved improved accuracy, while MobileNet retained competitive accuracy with minimal loss on CIFAR-100. The main contribution of this work lies in systematically analyzing and comparing pruning, INT8 quantization, and hybrid optimization methods across multiple architectures and datasets, with performance evaluated in terms of recall, latency, and memory requirements before and after optimization. Pruning and INT8 quantization reduced model size and inference time while preserving accuracy for TinyML deployment. These findings highlight practical approaches for enabling efficient TinyML deployment in real-world IoT applications.

Keywords: TinyML; Neural Network Optimization; Pruning; Quantization; IoT; Edge Computing (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s11235-025-01363-2 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:telsys:v:88:y:2025:i:4:d:10.1007_s11235-025-01363-2

Ordering information: This journal article can be ordered from
http://www.springer.com/journal/11235

DOI: 10.1007/s11235-025-01363-2

Access Statistics for this article

Telecommunication Systems: Modelling, Analysis, Design and Management is currently edited by Muhammad Khan

More articles in Telecommunication Systems: Modelling, Analysis, Design and Management from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().