TY - JOUR
T1 - HAResformer
T2 - A Hybrid ResNet-Transformer Hierarchical Aggregation Architecture for Visible-Infrared Person Re-Identification
AU - Qian, Yongheng
AU - Tang, Su Kit
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Modality differences and intra-modality variations make the visible-infrared person re-identification (VI-ReID) task highly challenging. Most existing methods focus on building network frameworks based on convolutional neural networks (CNN) or pure vision transformers (ViT) to extract discriminative features and address these challenges. However, these methods neglect several key issues: deeply fusing local features with global spatial information enhances comprehensive discriminative representation, patch tokens contain rich semantic information, and different feature extraction stages within the network emphasize various semantic elements. To address these issues, we propose a novel hybrid ResNet-transformer hierarchical aggregation architecture named HAResformer. HAResformer comprises three key components: hierarchical feature extraction (HFE) framework, deeply supervised aggregation (DSA), and hierarchical global aggregate encoder (HGAE). Specifically, HFE introduces a lightweight cross-encoder feature fusion module (CFFM) to deeply integrate the local features and global spatial information of a person extracted by the ResNet encoder (RE) and transformer encoder (TE). Subsequently, the fused features are fed as global priors into the next-stage TE for deep interaction, aiming to extract specific local features and global contextual clues. Additionally, DSA and HGAE provide auxiliary supervision and aggregation on multi-scale features to enhance multi-granularity feature representation. HAResformer effectively alleviates modality differences and reduces intra-modality variations. Extensive experiments on three benchmarks demonstrate the effectiveness and generalization of our architecture and outperform most state-of-the-art methods. HAResformer has the potential to become a new VI-ReID baseline, promoting high-quality research in the future.
AB - Modality differences and intra-modality variations make the visible-infrared person re-identification (VI-ReID) task highly challenging. Most existing methods focus on building network frameworks based on convolutional neural networks (CNN) or pure vision transformers (ViT) to extract discriminative features and address these challenges. However, these methods neglect several key issues: deeply fusing local features with global spatial information enhances comprehensive discriminative representation, patch tokens contain rich semantic information, and different feature extraction stages within the network emphasize various semantic elements. To address these issues, we propose a novel hybrid ResNet-transformer hierarchical aggregation architecture named HAResformer. HAResformer comprises three key components: hierarchical feature extraction (HFE) framework, deeply supervised aggregation (DSA), and hierarchical global aggregate encoder (HGAE). Specifically, HFE introduces a lightweight cross-encoder feature fusion module (CFFM) to deeply integrate the local features and global spatial information of a person extracted by the ResNet encoder (RE) and transformer encoder (TE). Subsequently, the fused features are fed as global priors into the next-stage TE for deep interaction, aiming to extract specific local features and global contextual clues. Additionally, DSA and HGAE provide auxiliary supervision and aggregation on multi-scale features to enhance multi-granularity feature representation. HAResformer effectively alleviates modality differences and reduces intra-modality variations. Extensive experiments on three benchmarks demonstrate the effectiveness and generalization of our architecture and outperform most state-of-the-art methods. HAResformer has the potential to become a new VI-ReID baseline, promoting high-quality research in the future.
KW - CNN
KW - Cross-Modality
KW - Feature Fusion
KW - Multi-Scale Supervision
KW - Person Re-Identification
KW - Vision Transformer
UR - http://www.scopus.com/inward/record.url?scp=86000451387&partnerID=8YFLogxK
U2 - 10.1109/JIOT.2025.3547920
DO - 10.1109/JIOT.2025.3547920
M3 - Article
AN - SCOPUS:86000451387
SN - 2327-4662
JO - IEEE Internet of Things Journal
JF - IEEE Internet of Things Journal
ER -