4.6-Bit Quantization for Fast and Accurate Neural Network Inference on CPUs
Anton Trusov (),
Elena Limonova,
Dmitry Nikolaev and
Vladimir V. Arlazarov
Additional contact information
Anton Trusov: Department of Mathematical Software for Computer Science, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 119333 Moscow, Russia
Elena Limonova: Department of Mathematical Software for Computer Science, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 119333 Moscow, Russia
Dmitry Nikolaev: Smart Engines Service LLC, 117312 Moscow, Russia
Vladimir V. Arlazarov: Department of Mathematical Software for Computer Science, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 119333 Moscow, Russia
Mathematics, 2024, vol. 12, issue 5, 1-22
Abstract:
Quantization is a widespread method for reducing the inference time of neural networks on mobile Central Processing Units (CPUs). Eight-bit quantized networks demonstrate similarly high quality as full precision models and perfectly fit the hardware architecture with one-byte coefficients and thirty-two-bit dot product accumulators. Lower precision quantizations usually suffer from noticeable quality loss and require specific computational algorithms to outperform eight-bit quantization. In this paper, we propose a novel 4.6-bit quantization scheme that allows for more efficient use of CPU resources. This scheme has more quantization bins than four-bit quantization and is more accurate while preserving the computational efficiency of the later (it runs only 4% slower). Our multiplication uses a combination of 16- and 32-bit accumulators and avoids multiplication depth limitation, which the previous 4-bit multiplication algorithm had. The experiments with different convolutional neural networks on CIFAR-10 and ImageNet datasets show that 4.6-bit quantized networks are 1.5–1.6 times faster than eight-bit networks on the ARMv8 CPU. Regarding the quality, the results of the 4.6-bit quantized network are close to the mean of four-bit and eight-bit networks of the same architecture. Therefore, 4.6-bit quantization may serve as an intermediate solution between fast and inaccurate low-bit network quantizations and accurate but relatively slow eight-bit ones.
Keywords: neural network quantization; deep learning; efficient computing; SIMD (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/12/5/651/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/5/651/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:5:p:651-:d:1344481
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().