TY - JOUR
T1 - ECTFormer
T2 - Efficient CNN-Transformer Network for Uncalibrated Multiview 3-D Human Pose Estimation
AU - Song, Jucheng
AU - Yang, Xu
AU - Wang, Yapeng
AU - Zhang, Jie
AU - Im, Sio Kei
N1 - Publisher Copyright:
© 2001-2012 IEEE.
PY - 2026/3/15
Y1 - 2026/3/15
N2 - Single-view camera sensors suffer from inherent depth blurring issues that hinder progress in 3-D human pose estimation (HPE), sparking widespread research interest in multiview camera sensor systems. However, existing methods typically rely on complex camera calibration processes and are sensitive to dynamic environments. To address these limitations, we propose ECTFormer, an innovative calibration-free multiview 3-D HPE framework that seamlessly combines Transformer-based spatiotemporal modeling with convolutional neural network (CNN)-based local feature extraction. The primary contents of this article are: Firstly, we introduce a hierarchical multiview spatiotemporal feature extraction network. This network avoids interference between noise from different viewpoints through hierarchical learning and employs a transformer to capture spatio-temporal features within views for subsequent fusion. And then, we design a CNN-transformer fusion module (CTFM) that efficiently aggregates multiview features, enabling accurate 3-D pose regression. Extensive experiments on a public 3-D human pose benchmark demonstrate that our approach attains superior performance without relying on calibration. For real-world environments equipped with dynamic camera sensors, ECTFormer can efficiently perform HPE.
AB - Single-view camera sensors suffer from inherent depth blurring issues that hinder progress in 3-D human pose estimation (HPE), sparking widespread research interest in multiview camera sensor systems. However, existing methods typically rely on complex camera calibration processes and are sensitive to dynamic environments. To address these limitations, we propose ECTFormer, an innovative calibration-free multiview 3-D HPE framework that seamlessly combines Transformer-based spatiotemporal modeling with convolutional neural network (CNN)-based local feature extraction. The primary contents of this article are: Firstly, we introduce a hierarchical multiview spatiotemporal feature extraction network. This network avoids interference between noise from different viewpoints through hierarchical learning and employs a transformer to capture spatio-temporal features within views for subsequent fusion. And then, we design a CNN-transformer fusion module (CTFM) that efficiently aggregates multiview features, enabling accurate 3-D pose regression. Extensive experiments on a public 3-D human pose benchmark demonstrate that our approach attains superior performance without relying on calibration. For real-world environments equipped with dynamic camera sensors, ECTFormer can efficiently perform HPE.
KW - Convolutional neural network (CNN)
KW - multiview 3-D human pose estimation (HPE)
KW - transformer
KW - uncalibrated camera
UR - https://www.scopus.com/pages/publications/105029549086
U2 - 10.1109/JSEN.2026.3658078
DO - 10.1109/JSEN.2026.3658078
M3 - Article
AN - SCOPUS:105029549086
SN - 1530-437X
VL - 26
SP - 8487
EP - 8498
JO - IEEE Sensors Journal
JF - IEEE Sensors Journal
IS - 6
ER -