Bayesian Prototypical Pruning for Transformers in Human–Robot Collaboration

Peng, Bohua; Chen, Bin

Bayesian Prototypical Pruning for Transformers in Human–Robot Collaboration

Bohua Peng and Bin Chen ()
Additional contact information
Bohua Peng: School of Automation, Northwestern Polytechnical University, Xi’an 710072, China
Bin Chen: School of Electrical and Electronic Engineering, The University of Sheffield, Sheffield S1 4DP, UK

Mathematics, 2025, vol. 13, issue 9, 1-20

Abstract: Action representations are essential for developing mutual cognition toward efficient human–AI collaboration, particularly in human–robot collaborative (HRC) workspaces. As such, it has become an emerging research direction for robots to understand human intentions with video Transformers. Despite their remarkable success in capturing long-range dependencies, local redundancy in video frames can add up to the inference latency of Transformers due to overparameterization. Recently, token pruning has become a computationally efficient solution that selectively removes input tokens with minimal impact on task performance. However, existing sparse coding methods often have an exhaustive threshold searching process, leading to intensive hyperparameter search. In this paper, Bayesian Prototypical Pruning (ProtoPrune), a novel end-to-end Bayesian framework, is proposed for token pruning in video understanding. To improve robustness, ProtoPrune leverages prototypical contrastive learning for fine-grained action representations, bringing sub-action level supervision to the video token pruning task. With variational dropout, our method bypasses the exhaustive threshold searching process. Experiments show that the proposed method can achieve a pruning rate of 37.2 % while retaining 92.9 % of task performance using Uniformer and ActionCLIP, which significantly improves computational efficiency. Convergence analysis ensures the stability of our method. The proposed efficient video understanding method offers a theoretically grounded and hardware-friendly solution for deploying video Transformers in real-world HRC environments.

Keywords: spatial–temporal modeling; sparse coding; human–robot collaboration; action recognition; inference optimization (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/9/1411/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/9/1411/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:9:p:1411-:d:1642501

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().