ECTFormer: Efficient CNN-Transformer Network for Uncalibrated Multi-view 3D Human Pose Estimation

Research output: Contribution to journalArticlepeer-review

Abstract

Single-view camera sensors suffer from inherent depth blurring issues that hinder progress in 3D human pose estimation (HPE), sparking widespread research interest in multi-view camera sensor systems. However, existing methods typically rely on complex camera calibration processes and are sensitive to dynamic environments. To address these limitations, we propose ECTFormer, an innovative calibration-free multi-view 3D HPE framework that seamlessly combines Transformer-based spatiotemporal modeling with CNN-based local feature extraction. The primary contents of this paper are: Firstly, we introduce a hierarchical multi-view spatio-temporal feature extraction network. This network avoids interference between noise from different viewpoints through hierarchical learning and employs a transformer to capture spatio-temporal features within views for subsequent fusion. And then, we design a CNN-Transformer fusion module (CTFM) that efficiently aggregates multi-view features, enabling accurate 3D pose regression. Extensive experiments on a public 3D human pose benchmark demonstrate that our approach attains superior performance without relying on calibration. For real-world environments equipped with dynamic camera sensors, ECTFormer can efficiently perform human pose estimation.

Original languageEnglish
JournalIEEE Sensors Journal
DOIs
Publication statusAccepted/In press - 2026

Keywords

  • CNN
  • Multi-view 3D Human Pose Estimation
  • Transformer
  • Uncalibrated Camera

Fingerprint

Dive into the research topics of 'ECTFormer: Efficient CNN-Transformer Network for Uncalibrated Multi-view 3D Human Pose Estimation'. Together they form a unique fingerprint.

Cite this