Abstract
Weakly-supervised Vision-Language Pre-training (W-VLP) explores methods leveraging weak cross-modal supervision, typically relying on object tags generated by a pre-trained object detector (OD) from images. However, training such an OD necessitates dense cross-modal information, including images paired with numerous object-level annotations. To alleviate that requirement, this paper addresses W-VLP in two stages: (1) creating data with weaker cross-modal supervision and (2) pre-training a vision-language (VL) model with the created data. The data creation process involves collecting knowledge from large language models (LLMs) to describe images. Given a category label of an image, its descriptions generated by an LLM are used as the language counterpart. This knowledge supplements what can be obtained using an OD, such as spatial relationships among objects most likely appearing in a scene. To mitigate the noise in the LLM-generated descriptions that destabilizes the training process and may lead to overfitting, we incorporate knowledge distillation and external retrieval-augmented knowledge during pre-training. Furthermore, we present an effective VL model pre-trained with the created data. Empirically, despite its weaker cross-modal supervision, our pre-trained VL model notably outperforms other W-VLP works in image and text retrieval tasks, e.g., VLMixer by 17.7% on MSCOCO and RELIT by 11.25% on Flickr30K relatively in Recall@1 in text-to-image retrieval task. It also shows superior performance on other VL downstream tasks, making a big stride towards matching the performances of strongly supervised VLP models. The results reveal the effectiveness of the proposed W-VLP methodology.
| Original language | English |
|---|---|
| Pages (from-to) | 8-15 |
| Number of pages | 8 |
| Journal | Pattern Recognition Letters |
| Volume | 197 |
| DOIs | |
| Publication status | Published - Nov 2025 |
| MoE publication type | A1 Journal article-refereed |
Funding
This work was supported by the Academy of Finland (USSEE project, No. 345791 ), and the National Natural Science Foundation of China (No. 62476188 ). Computational resources were provided by the LUMI supercomputer , owned by the EuroHPC Joint Undertaking, hosted by CSC and the LUMI consortium.
Keywords
- Deep learning
- Vision and language pre-training
- Weakly-supervised
Fingerprint
Dive into the research topics of 'Prompt-based Weakly-supervised Vision-language Pre-training'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver