Cross-Modal Attention Guided Enhanced Fusion Network for RGB-T Tracking

  • Jun Liu
  • , Wei Ke
  • , Shuai Wang
  • , Da Yang
  • , Hao Sheng

Research output: Contribution to journalArticlepeer-review

Abstract

Visual tracking that combines RGB and thermal infrared modalities (RGB-T) aims to utilize the useful information of each modality to achieve more robust object localization. Most existing tracking methods based on convolutional neural networks (CNNs) and Transformers emphasize integrating multi-modal features through cross-modal attention, but ignore the potential exploitability of complementary information learned by cross-modal attention for enhancing modal features. In this paper, we propose a novel hierarchical progressive fusion network based on cross-modal attention guided enhancement for RGB-T tracking. Specifically, the complementary information generated by cross-modal attention implicitly reflects the consistent regions of interest of important information between different modalities, which is used to enhance modal features in a targeted manner. In addition, a modal feature refinement module and a fusion module are designed based on dynamic routing to perform noise suppression and adaptive integration on the enhanced multi-modal features. Extensive experiments on GTOT, RGBT234, LasHeR and VTUAV show that our method has competitive performance compared with recent state-of-the-art methods.

Original languageEnglish
Pages (from-to)276-280
Number of pages5
JournalIEEE Signal Processing Letters
Volume33
DOIs
Publication statusPublished - Nov 2025

Keywords

  • Cross-modal attention
  • RGB-T tracking
  • dynamic routing
  • multi-modal fusion

Fingerprint

Dive into the research topics of 'Cross-Modal Attention Guided Enhanced Fusion Network for RGB-T Tracking'. Together they form a unique fingerprint.

Cite this