TY - GEN
T1 - Enhanced Video Caption Generation Based on Multimodal Features
AU - Huang, Xuefei
AU - Ke, Wei
AU - Sheng, Hao
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Video caption is the automatically generated of abstract expressions for the content contained in videos. It involves two important fields - computer vision and natural language processing, and has become a considerable research topic in smart life. Deep learning has successfully contributed to this task with good results. As we know, video contains various modals of information, yet most of the existing solutions start from the visual perspective of video, while ignoring the equally important audio modal information. Therefore, how to benefit from additional forms of cues other than visual information is a huge challenge. In this work, we propose a video caption generation method that fuses multimodal features in videos, and adds attention mechanism to improve the quality of generated description sentences. The experimental results demonstrate that the method is well validated on the MSR-VTT dataset.
AB - Video caption is the automatically generated of abstract expressions for the content contained in videos. It involves two important fields - computer vision and natural language processing, and has become a considerable research topic in smart life. Deep learning has successfully contributed to this task with good results. As we know, video contains various modals of information, yet most of the existing solutions start from the visual perspective of video, while ignoring the equally important audio modal information. Therefore, how to benefit from additional forms of cues other than visual information is a huge challenge. In this work, we propose a video caption generation method that fuses multimodal features in videos, and adds attention mechanism to improve the quality of generated description sentences. The experimental results demonstrate that the method is well validated on the MSR-VTT dataset.
KW - deep learning
KW - feature extraction
KW - multimodal feature fusion
KW - video caption generation
UR - http://www.scopus.com/inward/record.url?scp=85167811188&partnerID=8YFLogxK
U2 - 10.1109/UV56588.2022.10185501
DO - 10.1109/UV56588.2022.10185501
M3 - Conference contribution
AN - SCOPUS:85167811188
T3 - 6th IEEE International Conference on Universal Village, UV 2022
BT - 6th IEEE International Conference on Universal Village, UV 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th IEEE International Conference on Universal Village, UV 2022
Y2 - 22 October 2022 through 25 October 2022
ER -