Skip to main navigation Skip to search Skip to main content

Gtfpose: a unified framework with double-chain GCN–transformer fusion for 3D human pose estimation

Research output: Contribution to journalArticlepeer-review

Abstract

Monocular 3D human pose estimation faces numerous challenges, including depth blurring, self-occlusion, and significant pose variability. Existing methods typically rely on Graph Convolutional Networks (GCNs) to model local structure or employ Transformers to capture global relationships, yet both approaches suffer from fundamental limitations. GCNs struggle to capture global information, while Transformers are weak at extracting local details. To address these shortcomings and fuse their strengths, this study proposes the innovative unified dual-chain architecture GTFPose. Through an adaptive fusion mechanism, it dynamically balances GCNs and Transformers, leveraging both models’ advantages to ensure efficient modeling of local and global contexts. Simultaneously, we observe that Transformers cannot effectively extract both spatiotemporal information from positional encodings. To address this, we introduce a novel method, TJ-RoPE, which enhances long-term spatiotemporal reasoning by rotating positional embeddings along both joint and temporal axes. Comprehensive evaluations on Human3.6M and MPI-INF-3DHP datasets demonstrate that GTFPose surpasses existing methods on MPJPE and P-MPJPE metrics, setting new records and validating the effectiveness of the dual-chain fusion strategy for accurate and efficient 3D human pose estimation. Our code is available at: https://github.com/pray0915/GTFPose.git.

Original languageEnglish
Article number210
JournalVisual Computer
Volume42
Issue number5
DOIs
Publication statusPublished - Mar 2026

Keywords

  • 3D human pose estimation
  • Double-chain fusion
  • Graph convolutional networks
  • Rotary position embedding
  • Transformer

Fingerprint

Dive into the research topics of 'Gtfpose: a unified framework with double-chain GCN–transformer fusion for 3D human pose estimation'. Together they form a unique fingerprint.

Cite this