TY - JOUR
T1 - Gtfpose
T2 - a unified framework with double-chain GCN–transformer fusion for 3D human pose estimation
AU - Zhang, Junjia
AU - Song, Jucheng
AU - Yang, Xu
AU - Wang, Yapeng
AU - Im, Sio Kei
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2026.
PY - 2026/3
Y1 - 2026/3
N2 - Monocular 3D human pose estimation faces numerous challenges, including depth blurring, self-occlusion, and significant pose variability. Existing methods typically rely on Graph Convolutional Networks (GCNs) to model local structure or employ Transformers to capture global relationships, yet both approaches suffer from fundamental limitations. GCNs struggle to capture global information, while Transformers are weak at extracting local details. To address these shortcomings and fuse their strengths, this study proposes the innovative unified dual-chain architecture GTFPose. Through an adaptive fusion mechanism, it dynamically balances GCNs and Transformers, leveraging both models’ advantages to ensure efficient modeling of local and global contexts. Simultaneously, we observe that Transformers cannot effectively extract both spatiotemporal information from positional encodings. To address this, we introduce a novel method, TJ-RoPE, which enhances long-term spatiotemporal reasoning by rotating positional embeddings along both joint and temporal axes. Comprehensive evaluations on Human3.6M and MPI-INF-3DHP datasets demonstrate that GTFPose surpasses existing methods on MPJPE and P-MPJPE metrics, setting new records and validating the effectiveness of the dual-chain fusion strategy for accurate and efficient 3D human pose estimation. Our code is available at: https://github.com/pray0915/GTFPose.git.
AB - Monocular 3D human pose estimation faces numerous challenges, including depth blurring, self-occlusion, and significant pose variability. Existing methods typically rely on Graph Convolutional Networks (GCNs) to model local structure or employ Transformers to capture global relationships, yet both approaches suffer from fundamental limitations. GCNs struggle to capture global information, while Transformers are weak at extracting local details. To address these shortcomings and fuse their strengths, this study proposes the innovative unified dual-chain architecture GTFPose. Through an adaptive fusion mechanism, it dynamically balances GCNs and Transformers, leveraging both models’ advantages to ensure efficient modeling of local and global contexts. Simultaneously, we observe that Transformers cannot effectively extract both spatiotemporal information from positional encodings. To address this, we introduce a novel method, TJ-RoPE, which enhances long-term spatiotemporal reasoning by rotating positional embeddings along both joint and temporal axes. Comprehensive evaluations on Human3.6M and MPI-INF-3DHP datasets demonstrate that GTFPose surpasses existing methods on MPJPE and P-MPJPE metrics, setting new records and validating the effectiveness of the dual-chain fusion strategy for accurate and efficient 3D human pose estimation. Our code is available at: https://github.com/pray0915/GTFPose.git.
KW - 3D human pose estimation
KW - Double-chain fusion
KW - Graph convolutional networks
KW - Rotary position embedding
KW - Transformer
UR - https://www.scopus.com/pages/publications/105033332447
U2 - 10.1007/s00371-026-04417-x
DO - 10.1007/s00371-026-04417-x
M3 - Article
AN - SCOPUS:105033332447
SN - 0178-2789
VL - 42
JO - Visual Computer
JF - Visual Computer
IS - 5
M1 - 210
ER -