Skip to main navigation Skip to search Skip to main content

Adapting independent large-scale pretrained models for human action recognition

Research output: Contribution to journalArticleScientificpeer-review

Abstract

Transferring knowledge from large-scale, independently pretrained image and text models to video understanding requires addressing several challenges, including maintaining generalization capabilities of models, integrating them into multimodal architectures, and fine-tuning with temporal dynamics. This study evaluates the effectiveness of parameter-efficient fine-tuning (PEFT) techniques in transferring pretrained knowledge from two independent models for video action recognition within a simple, streamlined multimodal fusion pipeline. Specifically, we adapt CLIP as the text branch and DINOv2 as the image branch, keeping both backbones frozen to preserve their pretrained robustness, while introducing lightweight, task-specific modules to adapt and fuse the branches with temporal dynamics. A simple fusion transformer combines the image and text branches, enabling their efficient integration with minimal training cost. We systematically evaluate the framework on widely-recognized midscale video benchmark datasets, comparing prompt-based and adapter-based PEFT techniques across different data regimes. Our results demonstrate that this combination achieves competitive performance, highlights the transferability and scalability of independent pretrained models for a targeted task, and provides practical insights for adapting large models using midscale, task specific video datasets. In particular, adaptations of the DINOv2 image encoder and CLIP text encoder improve recognition accuracy over the frozen baseline up to an average absolute gains of 3.47\% across K5--KAll. Moreover, the proposed DoRA DINOv2 combined with an adapter-based CLIP text encoder achieves competitive state-of-the-art performance on UCF101, HMDB51, and DIVING48, consistently outperforming prior methods in few-shot scenarios and reaching up to 82.0% accuracy with K2 training examples.
Original languageEnglish
Pages (from-to)437-452
JournalTurkish Journal of Electrical Engineering and Computer Sciences
Volume34
Issue number3
DOIs
Publication statusPublished - 15 May 2026
MoE publication typeA1 Journal article-refereed

Keywords

  • Action recognition
  • multimodal
  • parameter-efficient fine-tuning
  • adaptation

Fingerprint

Dive into the research topics of 'Adapting independent large-scale pretrained models for human action recognition'. Together they form a unique fingerprint.

Cite this