Abstract
Remote sensing image change captioning (RSICC) aims to generate natural language descriptions of the changes in bi-temporal images. Existing RSICC datasets primarily focus on building change descriptions and lack descriptions of the dynamic processes in complex coastal cities. Furthermore, current algorithms often rely on transformer for difference feature comparison, but they lack local spatial inductive bias. To address these issues, we created the Macao Land Cover Change (MLCC) dataset, annotated with standardized directional terms. Meanwhile, DINOv3 Guided Difference Feature Fusion Change Captioning algorithm (DINO-DFFCC) is proposed. DINO-DFFCC uses the frozen DINOv3 as a feature encoder to obtain robust semantic features. Bi-temporal Difference Feature Adaptor (BDFA) is designed to align the semantic features from DINOv3 with the coarse-grained difference maps extracted by convolution. Re-parameterized convolution difference feature fusion module (RCDFF) is designed to iteratively fuse semantic and difference information, capturing multi-scale spatial context. Experimental results show that DINO-DFFCC outperforms the SOTA methods on the MLCC dataset, with BLEU4 of 0.4547 and CIDEr of 1.5125. The dataset and code are available at https://github.com/juncyan/dffcc.git.
| Original language | English |
|---|---|
| Journal | International Journal of Remote Sensing |
| DOIs | |
| Publication status | Accepted/In press - 2026 |
Keywords
- DINOv3
- Remote sensing
- change captioning
- land cover
- re-parameterized convolution
Fingerprint
Dive into the research topics of 'DINOv3 Guided difference feature fusion for remote sensing image change captioning: a case study on Macao land cover'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver