Rethinking I/O Caching for Large Language Model Inference on Resource-Constrained Mobile Platforms

Kim, Heejin; Lee, Jeongha; Bahn, Hyokyung

Rethinking I/O Caching for Large Language Model Inference on Resource-Constrained Mobile Platforms

Heejin Kim, Jeongha Lee and Hyokyung Bahn ()
Additional contact information
Heejin Kim: Department of Computer Engineering, Ewha University, Seoul 03760, Republic of Korea
Jeongha Lee: Department of Computer Engineering, Ewha University, Seoul 03760, Republic of Korea
Hyokyung Bahn: Department of Computer Engineering, Ewha University, Seoul 03760, Republic of Korea

Mathematics, 2025, vol. 13, issue 22, 1-17

Abstract: Large language models (LLMs) have traditionally relegated inference to remote servers, leaving mobile devices as thin clients. Recently, advances in mobile GPUs and NPUs have made on-device inference increasingly feasible, particularly for privacy-sensitive and personalized applications. However, executing LLMs directly on resource-constrained devices exposes severe I/O bottlenecks, as repeated accesses to large weight files can overwhelm limited memory and storage bandwidth. Prior studies have focused on internal mechanisms such as KV caching, while the role of the host OS buffer cache remains underexplored. This paper closes that gap with file-level trace analysis of real-world mobile LLM applications, and identifies three characteristic access patterns: (1) one-time sequential scans during initialization, (2) persistent hot sets (e.g., tokenizers, metadata, indices), and (3) recurring loop accesses to model weight files. Guided by these observations, we propose LLM-aware buffer cache strategies and derive cache-sizing guidelines that relate loop size, host-set coverage, and storage bandwidth. We further compare smartwatch-class and smartphone-class platforms to clarify feasible model sizes and practical hardware prerequisites for local inference. Our results provide system-level guidance for I/O subsystem design that enables practical on-device LLM inference in future mobile and IoT devices.

Keywords: large language model; I/O caching; mobile platform; buffer cache; inference (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/22/3689/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/22/3689/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:22:p:3689-:d:1796645

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().