End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning

JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS(2024)

引用 0|浏览5
暂无评分
摘要
In this paper, we propose an end-to-end dual-stream transformer with a parallel encoder (DST-PE) for video captioning, which combines multimodal features and global-local representations to generate coherent captions. First, we design a parallel encoder that includes a local visual encoder and a bridge module, which simultaneously generates refined local and global visual features. Second, we devise a multimodal encoder to enhance the representation ability of our model. Finally, we adopt a transformer decoder with multimodal features as inputs and local visual features fused with textual features using a cross-attention block. Extensive experimental results demonstrate that our model achieves state-of-the-art performance with low training costs on several widely used datasets.
更多
查看译文
关键词
Video captioning,parallel encoder,multimodal encoder,end-to-end,global-local representations,transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要