Enhanced Video Caption Generation Based on Multimodal Features

Xuefei Huang, Wei Ke, Hao Sheng

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Video caption is the automatically generated of abstract expressions for the content contained in videos. It involves two important fields - computer vision and natural language processing, and has become a considerable research topic in smart life. Deep learning has successfully contributed to this task with good results. As we know, video contains various modals of information, yet most of the existing solutions start from the visual perspective of video, while ignoring the equally important audio modal information. Therefore, how to benefit from additional forms of cues other than visual information is a huge challenge. In this work, we propose a video caption generation method that fuses multimodal features in videos, and adds attention mechanism to improve the quality of generated description sentences. The experimental results demonstrate that the method is well validated on the MSR-VTT dataset.

Original languageEnglish
Title of host publication6th IEEE International Conference on Universal Village, UV 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665474771
DOIs
Publication statusPublished - 2022
Event6th IEEE International Conference on Universal Village, UV 2022 - Hybrid, Boston, United States
Duration: 22 Oct 202225 Oct 2022

Publication series

Name6th IEEE International Conference on Universal Village, UV 2022

Conference

Conference6th IEEE International Conference on Universal Village, UV 2022
Country/TerritoryUnited States
CityHybrid, Boston
Period22/10/2225/10/22

Keywords

  • deep learning
  • feature extraction
  • multimodal feature fusion
  • video caption generation

Fingerprint

Dive into the research topics of 'Enhanced Video Caption Generation Based on Multimodal Features'. Together they form a unique fingerprint.

Cite this