Abstract
The rapid proliferation of multimedia content has necessitated the development of effective multimodal video retrieval systems. Multimodal video retrieval is a non-trivial task involving retrieval of relevant information across different modalities, such as text, audio, and visual. This work aims to improve multimodal retrieval by guiding the creation of a shared embedding space with task-specific contrastive loss functions. An important aspect of our work is to propose a model that learns retrieval cues for the textual query from multiple modalities both separately and jointly within a hierarchical architecture that can be flexibly extended and fine-tuned for any number of modalities. To this end, the loss functions and the architectural design of the model are developed with a strong focus on increasing the mutual information between the textual and cross-modal representations. The proposed approach is quantitatively evaluated on the MSR-VTT and YouCook2 text-to-video retrieval benchmark datasets. The results showcase that the approach not only holds its own against state-of-the-art methods, but also outperforms them in a number of scenarios, with a notable relative improvements from baseline in R@1, R@5 and R@10 metrics.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) |
| Publisher | ELRA Language Resources Association |
| Pages | 15823–15834 |
| ISBN (Electronic) | 978-2-493814-10-4 |
| Publication status | Published - 2024 |
| MoE publication type | A4 Article in a conference publication |
| Event | 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Torino, Italy Duration: 20 May 2024 → 25 May 2024 |
Conference
| Conference | 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 |
|---|---|
| Country/Territory | Italy |
| City | Torino |
| Period | 20/05/24 → 25/05/24 |
Funding
This research has been funded by the Research Council of Finland in project #345791 Understanding speech and scene with ears and eyes (USSEE).
Fingerprint
Dive into the research topics of 'Text-to-Multimodal Retrieval with Bimodal Input Fusion in Shared Cross-Modal Transformer'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver