TY - JOUR
T1 - SQ-ViT
T2 - A Multi-Scale Vision Transformer With Quaternion For Endoscopic Images Classification
AU - Jin, Zhanjun
AU - Huang, Guoheng
AU - Zhang, Feng
AU - Yuan, Xiaochen
AU - Zhu, Dingzhou
AU - Tan, Zhe
AU - Pun, Chi Man
AU - Zhong, Guo
N1 - Publisher Copyright:
© 1975-2011 IEEE.
PY - 2024
Y1 - 2024
N2 - In the field of medical consumer electronics, endoscopic imaging technology especially electronic nasopharyngoscope imaging, often suffers from low resolution, which poses a difficulty for endoscopic images classification due to the loss of image details. Recent advancements in Vision Transformer (ViT) based methods have shown promise in addressing this problem. However, ViT relies heavily on global context information to maintain performance, and the limited pixel count in lowresolution images poses a challenge in capturing adequate global context information. To address these challenges, we propose the Sequential Quaternion Vision Transformer (SQ-ViT), which improves multi-scale feature utilization by feeding sampled features into the subsequent encoder layers. Specifically, we introduce the Multi-scale Visual Feature Fusion (MVFF) module, which segments the image into multiple superpixel blocks and refines the contour and color information of the processed image, which helps to enhance the representation of visual features. Additionally, visual information would be captured more effectively by our proposed Quaternion Interactive Encoder (QIE). Experiments demonstrate the effectiveness of SQ-ViT in improving multi-scale feature utilization and addressing challenges in low-resolution endoscopic imaging for endoscopic images classification. The source code will be released at https://github.com/jinzhanjun625/SQViT.
AB - In the field of medical consumer electronics, endoscopic imaging technology especially electronic nasopharyngoscope imaging, often suffers from low resolution, which poses a difficulty for endoscopic images classification due to the loss of image details. Recent advancements in Vision Transformer (ViT) based methods have shown promise in addressing this problem. However, ViT relies heavily on global context information to maintain performance, and the limited pixel count in lowresolution images poses a challenge in capturing adequate global context information. To address these challenges, we propose the Sequential Quaternion Vision Transformer (SQ-ViT), which improves multi-scale feature utilization by feeding sampled features into the subsequent encoder layers. Specifically, we introduce the Multi-scale Visual Feature Fusion (MVFF) module, which segments the image into multiple superpixel blocks and refines the contour and color information of the processed image, which helps to enhance the representation of visual features. Additionally, visual information would be captured more effectively by our proposed Quaternion Interactive Encoder (QIE). Experiments demonstrate the effectiveness of SQ-ViT in improving multi-scale feature utilization and addressing challenges in low-resolution endoscopic imaging for endoscopic images classification. The source code will be released at https://github.com/jinzhanjun625/SQViT.
KW - Endoscopic images Classification
KW - Endoscopy
KW - Interpretability
KW - Quaternion Convolution
KW - Superpixel
KW - Vision Transformer
UR - http://www.scopus.com/inward/record.url?scp=85213027035&partnerID=8YFLogxK
U2 - 10.1109/TCE.2024.3518755
DO - 10.1109/TCE.2024.3518755
M3 - Article
AN - SCOPUS:85213027035
SN - 0098-3063
JO - IEEE Transactions on Consumer Electronics
JF - IEEE Transactions on Consumer Electronics
ER -