E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition
Yi Zhang,
Yang Shao,
Chen Tang,
Zhenqing Liu,
Zhengda Li,
Ruifang Zhai,
Hui Peng () and
Peng Song ()
Additional contact information
Yi Zhang: College of Informatics, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China
Yang Shao: College of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China
Chen Tang: College of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China
Zhenqing Liu: College of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China
Zhengda Li: College of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China
Ruifang Zhai: College of Informatics, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China
Hui Peng: College of Informatics, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China
Peng Song: College of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China
Agriculture, 2025, vol. 15, issue 11, 1-32
Abstract:
With the progress of agricultural modernization, intelligent fruit harvesting is gaining importance. While fruit detection and recognition are essential for robotic harvesting, existing methods suffer from limited generalizability, including adapting to complex environments and handling new fruit varieties. This problem stems from their reliance on unimodal visual data, which creates a semantic gap between image features and contextual understanding. To solve these issues, this study proposes a multi-modal fruit detection and recognition framework based on visual language models (VLMs). By integrating multi-modal information, the proposed model enhances robustness and generalization across diverse environmental conditions and fruit types. The framework accepts natural language instructions as input, facilitating effective human–machine interaction. Through its core module, Enhanced Contrastive Language–Image Pre-Training (E-CLIP), which employs image–image and image–text contrastive learning mechanisms, the framework achieves robust recognition of various fruit types and their maturity levels. Experimental results demonstrate the excellent performance of the model, achieving an F1 score of 0.752, and an mAP@0.5 of 0.791. The model also exhibits robustness under occlusion and varying illumination conditions, attaining a zero-shot mAP@0.5 of 0.626 for unseen fruits. In addition, the system operates at an inference speed of 54.82 FPS, effectively balancing speed and accuracy, and shows practical potential for smart agriculture. This research provides new insights and methods for the practical application of smart agriculture.
Keywords: visual language models; contrast learning; smart agriculture (search for similar items in EconPapers)
JEL-codes: Q1 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2077-0472/15/11/1173/pdf (application/pdf)
https://www.mdpi.com/2077-0472/15/11/1173/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jagris:v:15:y:2025:i:11:p:1173-:d:1667641
Access Statistics for this article
Agriculture is currently edited by Ms. Leda Xuan
More articles in Agriculture from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().