Edit As You Wish: Video Caption Editing with Multi-grained User Control
arxiv(2023)
摘要
Automatically narrating videos in natural language complying with user
requests, i.e. Controllable Video Captioning task, can help people manage
massive videos with desired intentions. However, existing works suffer from two
shortcomings: 1) the control signal is single-grained which can not satisfy
diverse user intentions; 2) the video description is generated in a single
round which can not be further edited to meet dynamic needs. In this paper, we
propose a novel Video Caption Editing (VCE)
task to automatically revise an existing video description guided by
multi-grained user requests. Inspired by human writing-revision habits, we
design the user command as a pivotal triplet {operation, position,
attribute} to cover diverse user needs from coarse-grained to fine-grained.
To facilitate the VCE task, we automatically construct an open-domain
benchmark dataset named VATEX-EDIT and manually collect an e-commerce
dataset called EMMAD-EDIT. We further propose a specialized small-scale model
(i.e., OPA) compared with two generalist Large Multi-modal Models to perform an
exhaustive analysis of the novel task. For evaluation, we adopt comprehensive
metrics considering caption fluency, command-caption consistency, and
video-caption alignment. Experiments reveal the task challenges of fine-grained
multi-modal semantics understanding and processing. Our datasets, codes, and
evaluation tools are ready to be open-sourced.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要