Size-Modulated Deformable Attention in Spatio-Temporal Video Grounding Pipelines

Hans Tiwari*, Selen Pehlivan, Jorma Laaksonen

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

Abstract

The integration of attention mechanisms into computer vision tasks, inspired by the success of Transformers in natural language processing, has revolutionized various applications such as object detection and visual grounding. In this paper, we focus on spatiotemporal video grounding (STVG), a computer vision task that aims to jointly extract spatial and temporal regions from videos based on textual descriptions. Leveraging recent advancements in attention-based Transformer architectures, particularly in object detectors, and building upon a recent baseline model, we integrate two enhancements in attention modules: Width-Height Modulation and Deformable Attention units. These enhancements aim to improve the accuracy and efficiency of STVG techniques in two datasets, HC-STVG and VidSTG, by addressing challenges related to feature inconsistencies and prediction reliability across video frames. As a result, our study contributes to advancing the baseline models in spatio-temporal video grounding, bridging the gap between computer vision and natural language processing domains.
Original languageEnglish
Title of host publicationPattern Recognition - 27th International Conference, ICPR 2024, Proceedings
EditorsApostolos Antonacopoulos, Subhasis Chaudhuri, Rama Chellappa, Cheng-Lin Liu, Saumik Bhattacharya, Umapada Pal
PublisherSpringer
Pages308-324
ISBN (Electronic)978-3-031-78456-9
ISBN (Print)978-3-031-78455-2
DOIs
Publication statusPublished - 2025
MoE publication typeA4 Article in a conference publication
EventThe International Conference on Pattern Recognition - Kolkata, India
Duration: 1 Dec 20245 Dec 2024
https://icpr2024.org/

Publication series

SeriesLecture Notes in Computer Science
Number15318
ISSN0302-9743

Conference

ConferenceThe International Conference on Pattern Recognition
Abbreviated titleICPR
Country/TerritoryIndia
Period1/12/245/12/24
Internet address

Keywords

  • Video Grounding
  • Spatio-Temporal Video Grounding
  • Attention Unit
  • Transformers

Fingerprint

Dive into the research topics of 'Size-Modulated Deformable Attention in Spatio-Temporal Video Grounding Pipelines'. Together they form a unique fingerprint.

Cite this