Abstract
Transferring knowledge from large-scale, independently pretrained image and text models to video understanding requires addressing several challenges, including maintaining generalization capabilities of models, integrating them into multimodal architectures, and fine-tuning with temporal dynamics. This study evaluates the effectiveness of parameter-efficient fine-tuning (PEFT) techniques in transferring pretrained knowledge from two independent models for video action recognition within a simple, streamlined multimodal fusion pipeline. Specifically, we adapt CLIP as the text branch and DINOv2 as the image branch, keeping both backbones frozen to preserve their pretrained robustness, while introducing lightweight, task-specific modules to adapt and fuse the branches with temporal dynamics. A simple fusion transformer combines the image and text branches, enabling their efficient integration with minimal training cost. We systematically evaluate the framework on widely-recognized midscale video benchmark datasets, comparing prompt-based and adapter-based PEFT techniques across different data regimes. Our results demonstrate that this combination achieves competitive performance, highlights the transferability and scalability of independent pretrained models for a targeted task, and provides practical insights for adapting large models using midscale, task specific video datasets. In particular, adaptations of the DINOv2 image encoder and CLIP text encoder improve recognition accuracy over the frozen baseline up to an average absolute gains of 3.47\% across K5--KAll. Moreover, the proposed DoRA DINOv2 combined with an adapter-based CLIP text encoder achieves competitive state-of-the-art performance on UCF101, HMDB51, and DIVING48, consistently outperforming prior methods in few-shot scenarios and reaching up to 82.0% accuracy with K2 training examples.
| Original language | English |
|---|---|
| Pages (from-to) | 437-452 |
| Journal | Turkish Journal of Electrical Engineering and Computer Sciences |
| Volume | 34 |
| Issue number | 3 |
| DOIs | |
| Publication status | Published - 15 May 2026 |
| MoE publication type | A1 Journal article-refereed |
Keywords
- Action recognition
- multimodal
- parameter-efficient fine-tuning
- adaptation
Fingerprint
Dive into the research topics of 'Adapting independent large-scale pretrained models for human action recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver