跳至主導覽 跳至搜尋 跳過主要內容

ECTFormer: Efficient CNN-Transformer Network for Uncalibrated Multiview 3-D Human Pose Estimation

研究成果: Article同行評審

摘要

Single-view camera sensors suffer from inherent depth blurring issues that hinder progress in 3-D human pose estimation (HPE), sparking widespread research interest in multiview camera sensor systems. However, existing methods typically rely on complex camera calibration processes and are sensitive to dynamic environments. To address these limitations, we propose ECTFormer, an innovative calibration-free multiview 3-D HPE framework that seamlessly combines Transformer-based spatiotemporal modeling with convolutional neural network (CNN)-based local feature extraction. The primary contents of this article are: Firstly, we introduce a hierarchical multiview spatiotemporal feature extraction network. This network avoids interference between noise from different viewpoints through hierarchical learning and employs a transformer to capture spatio-temporal features within views for subsequent fusion. And then, we design a CNN-transformer fusion module (CTFM) that efficiently aggregates multiview features, enabling accurate 3-D pose regression. Extensive experiments on a public 3-D human pose benchmark demonstrate that our approach attains superior performance without relying on calibration. For real-world environments equipped with dynamic camera sensors, ECTFormer can efficiently perform HPE.

原文English
頁(從 - 到)8487-8498
頁數12
期刊IEEE Sensors Journal
26
發行號6
DOIs
出版狀態Published - 15 3月 2026

指紋

深入研究「ECTFormer: Efficient CNN-Transformer Network for Uncalibrated Multiview 3-D Human Pose Estimation」主題。共同形成了獨特的指紋。

引用此