Abstract
Single-view camera sensors suffer from inherent depth blurring issues that hinder progress in 3D human pose estimation (HPE), sparking widespread research interest in multi-view camera sensor systems. However, existing methods typically rely on complex camera calibration processes and are sensitive to dynamic environments. To address these limitations, we propose ECTFormer, an innovative calibration-free multi-view 3D HPE framework that seamlessly combines Transformer-based spatiotemporal modeling with CNN-based local feature extraction. The primary contents of this paper are: Firstly, we introduce a hierarchical multi-view spatio-temporal feature extraction network. This network avoids interference between noise from different viewpoints through hierarchical learning and employs a transformer to capture spatio-temporal features within views for subsequent fusion. And then, we design a CNN-Transformer fusion module (CTFM) that efficiently aggregates multi-view features, enabling accurate 3D pose regression. Extensive experiments on a public 3D human pose benchmark demonstrate that our approach attains superior performance without relying on calibration. For real-world environments equipped with dynamic camera sensors, ECTFormer can efficiently perform human pose estimation.
| Original language | English |
|---|---|
| Journal | IEEE Sensors Journal |
| DOIs | |
| Publication status | Accepted/In press - 2026 |
Keywords
- CNN
- Multi-view 3D Human Pose Estimation
- Transformer
- Uncalibrated Camera
Fingerprint
Dive into the research topics of 'ECTFormer: Efficient CNN-Transformer Network for Uncalibrated Multi-view 3D Human Pose Estimation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver