TY - JOUR
T1 - A simple transformer-based baseline for crowd tracking with Sequential Feature Aggregation and Hybrid Group Training
AU - Wang, Cui
AU - Wu, Zewei
AU - Ke, Wei
AU - Xiong, Zhang
N1 - Publisher Copyright:
© 2024 Elsevier Inc.
PY - 2024/4
Y1 - 2024/4
N2 - Tracking pedestrians in crowded scenes is a challenging task. Existing transformer-based tracking methods integrate detection and tracking into a unified model, which simplifies the tracking process. However, these methods also introduce complicated attention mechanisms that increase the model complexity and cost. To address this issue, we propose SOTTrack, a simple online transformer-based method for crowd tracking. Our method enhances feature learning and training strategies without sacrificing simplicity and efficiency. Specifically, we introduce the Sequential Feature Aggregation (SFA) module and the Hybrid Group Training (HGT) approach. The SFA module fuses features from sequential images to improve the temporal consistency of visual features within short time intervals. The HGT approach assigns different queries to multiple guided tasks, such as label assignment and de-noising, which are only used during training and do not incur any inference cost. We evaluate our method on the MOT17 and MOT20 datasets and demonstrate its competitive performance.
AB - Tracking pedestrians in crowded scenes is a challenging task. Existing transformer-based tracking methods integrate detection and tracking into a unified model, which simplifies the tracking process. However, these methods also introduce complicated attention mechanisms that increase the model complexity and cost. To address this issue, we propose SOTTrack, a simple online transformer-based method for crowd tracking. Our method enhances feature learning and training strategies without sacrificing simplicity and efficiency. Specifically, we introduce the Sequential Feature Aggregation (SFA) module and the Hybrid Group Training (HGT) approach. The SFA module fuses features from sequential images to improve the temporal consistency of visual features within short time intervals. The HGT approach assigns different queries to multiple guided tasks, such as label assignment and de-noising, which are only used during training and do not incur any inference cost. We evaluate our method on the MOT17 and MOT20 datasets and demonstrate its competitive performance.
KW - Crowd tracking
KW - Hybrid Group Training
KW - Temporal enhanced representation
KW - Transformer-based tracking
UR - http://www.scopus.com/inward/record.url?scp=85189940879&partnerID=8YFLogxK
U2 - 10.1016/j.jvcir.2024.104144
DO - 10.1016/j.jvcir.2024.104144
M3 - Article
AN - SCOPUS:85189940879
SN - 1047-3203
VL - 100
JO - Journal of Visual Communication and Image Representation
JF - Journal of Visual Communication and Image Representation
M1 - 104144
ER -