Learning to Describe Implicit Changes: Noise-robust Pre-training for Image Difference Captioning

  • Zixin Guo
  • , Jiayang Sun
  • , Tzu Jui Julius Wang
  • , Abduljalil Radman
  • , Selen Pehlivan
  • , Min Cao
  • , Jorma Laaksonen

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

Abstract

Image Difference Captioning (IDC) methods have advanced in highlighting subtle differences between similar images, but their performance is often constrained by limited training data. Using Large Multimodal Models (LMMs) to describe changes in image pairs mitigates data limits but adds noise. These change descriptions are often coarse summaries, obscuring fine details and hindering noise detection. In this work, we improve IDC with a noise-robust approach at both data and model levels. We use LMMs with structured prompts to generate fine-grained change descriptions during data curation. We propose a Noise-Aware Modeling and Captioning (NAMC) model with three modules: Noise Identification and Masking (NIM) to reduce noisy correspondences, Masked Image Reconstruction (MIR) to correct over-masking errors, and Fine-grained Description Generation (FDG) to produce coherent change descriptions. Experiments on four IDC benchmarks show that NAMC, pre-trained on our large-scale data, outperforms streamlined architectures and achieves competitive performance with LLM-finetuned methods, offering better inference efficiency.
Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics: EMNLP 2025
EditorsChristos Christodoupolous, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Place of PublicationKerrville
PublisherAssociation for Computational Linguistics (ACL)
Pages10125-10145
ISBN (Print)979-8-89176-335-7
DOIs
Publication statusPublished - 2025
MoE publication typeA4 Article in a conference publication
Event2025 Conference on Empirical Methods in Natural Language Processing - Suzhou, China
Duration: 4 Nov 20259 Nov 2025

Conference

Conference2025 Conference on Empirical Methods in Natural Language Processing
Country/TerritoryChina
CitySuzhou
Period4/11/259/11/25

Fingerprint

Dive into the research topics of 'Learning to Describe Implicit Changes: Noise-robust Pre-training for Image Difference Captioning'. Together they form a unique fingerprint.

Cite this