Efficient GPT-4V level multimodal large language model for deployment on edge devices
Yuan Yao,
Tianyu Yu,
Ao Zhang,
Chongyi Wang,
Junbo Cui,
Hongji Zhu,
Tianchi Cai,
Chi Chen,
Haoyu Li,
Weilin Zhao,
Zhihui He,
Qianyu Chen,
Ronghua Zhou,
Zhensheng Zou,
Haoye Zhang,
Shengding Hu,
Zhi Zheng,
Jie Zhou,
Jie Cai,
Xu Han,
Guoyang Zeng,
Dahai Li,
Zhiyuan Liu () and
Maosong Sun ()
Additional contact information
Yuan Yao: Tsinghua University
Tianyu Yu: Tsinghua University
Ao Zhang: National University of Singapore
Chongyi Wang: ModelBest Inc.
Junbo Cui: ModelBest Inc.
Hongji Zhu: ModelBest Inc.
Tianchi Cai: ModelBest Inc.
Chi Chen: Tsinghua University
Haoyu Li: Tsinghua University
Weilin Zhao: Tsinghua University
Zhihui He: Tsinghua University
Qianyu Chen: The Chinese University of Hong Kong
Ronghua Zhou: ModelBest Inc.
Zhensheng Zou: ModelBest Inc.
Haoye Zhang: Tsinghua University
Shengding Hu: Tsinghua University
Zhi Zheng: ModelBest Inc.
Jie Zhou: ModelBest Inc.
Jie Cai: ModelBest Inc.
Xu Han: Tsinghua University
Guoyang Zeng: ModelBest Inc.
Dahai Li: ModelBest Inc.
Zhiyuan Liu: Tsinghua University
Maosong Sun: Tsinghua University
Nature Communications, 2025, vol. 16, issue 1, 1-14
Abstract:
Abstract Multimodal large language models have revolutionized AI research and industry, paving the way toward the next milestone. However, their large sizes and high computational costs restrict deployment to cloud servers, limiting use in mobile, offline, energy-sensitive, or privacy-critical scenarios. We present MiniCPM-V, efficient models for edge devices that integrate advancements in architecture, training, and data. The 8B model outperforms GPT-4V, Gemini Pro, and Claude 3 across 11 public benchmarks, processes high-resolution images at any aspect ratio, achieves robust optical character recognition, exhibits low hallucination rates, and supports over 30 languages while running efficiently on mobile phones. This progress reflects a broader trend: The sizes for high-performing models are rapidly decreasing alongside growing edge computation capacity, enabling advanced multimodal models to operate locally on consumer hardware. Such developments unlock applications across diverse real-world scenarios, from enhanced mobile AI to privacy-preserving solutions, marking a critical step toward democratizing powerful multimodal intelligence.
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.nature.com/articles/s41467-025-61040-5 Abstract (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-61040-5
Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/
DOI: 10.1038/s41467-025-61040-5
Access Statistics for this article
Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie
More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().