Edit3K: Universal Representation Learning for Video Editing Components
arxiv(2024)
摘要
This paper focuses on understanding the predominant video creation pipeline,
i.e., compositional video editing with six main types of editing components,
including video effects, animation, transition, filter, sticker, and text. In
contrast to existing visual representation learning of visual materials (i.e.,
images/videos), we aim to learn visual representations of editing
actions/components that are generally applied on raw materials. We start by
proposing the first large-scale dataset for editing components of video
creation, which covers about 3,094 editing components with 618,800 videos.
Each video in our dataset is rendered by various image/video materials with a
single editing component, which supports atomic visual understanding of
different editing components. It can also benefit several downstream tasks,
e.g., editing component recommendation, editing component
recognition/retrieval, etc. Existing visual representation methods perform
poorly because it is difficult to disentangle the visual appearance of editing
components from raw materials. To that end, we benchmark popular alternative
solutions and propose a novel method that learns to attend to the appearance of
editing components regardless of raw materials. Our method achieves favorable
results on editing component retrieval/recognition compared to the alternative
solutions. A user study is also conducted to show that our representations
cluster visually similar editing components better than other alternatives.
Furthermore, our learned representations used to transition recommendation
tasks achieve state-of-the-art results on the AutoTransition dataset. The code
and dataset will be released for academic use.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要