摘要
The task of dense video captioning is to generate detailed natural-language descriptions for an original video, which requires deep analysis and mining of semantic captions to identify events in the video. Existing methods typically follow a localisation-then-captioning sequence within given frame sequences, resulting in caption generation that is highly dependent on which objects have been detected. This work proposes a parallel-based dense video captioning method that can simultaneously address the mutual constraint between event proposals and captions. Additionally, a deformable Transformer framework is introduced to reduce or free manual threshold of hyperparameters in such methods. An information transfer station is also added as a representation organisation, which receives the hidden features extracted from a frame and implicitly generates multiple event proposals. The proposed method also adopts LSTM (Long short-term memory) with deformable attention as the main layer for caption generation. Experimental results show that the proposed method outperforms other methods in this area to a certain degree on the ActivityNet Caption dataset, providing competitive results.
原文 | English |
---|---|
文章編號 | 3685 |
期刊 | Mathematics |
卷 | 11 |
發行號 | 17 |
DOIs | |
出版狀態 | Published - 9月 2023 |
指紋
深入研究「Parallel Dense Video Caption Generation with Multi-Modal Features」主題。共同形成了獨特的指紋。新聞/媒體
-
Findings in Mathematics Reported from Faculty of Applied Sciences (Parallel Dense Video Caption Generation with Multi-Modal Features)
12/09/23
1 的項目 媒體報導
新聞/媒體: Press/Media