AutoAD III: The Prequel -- Back to the Pixels
CVPR 2024(2024)
摘要
Generating Audio Description (AD) for movies is a challenging task that
requires fine-grained visual understanding and an awareness of the characters
and their names. Currently, visual language models for AD generation are
limited by a lack of suitable training data, and also their evaluation is
hampered by using performance measures not specialized to the AD domain. In
this paper, we make three contributions: (i) We propose two approaches for
constructing AD datasets with aligned video data, and build training and
evaluation datasets using these. These datasets will be publicly released; (ii)
We develop a Q-former-based architecture which ingests raw video and generates
AD, using frozen pre-trained visual encoders and large language models; and
(iii) We provide new evaluation metrics to benchmark AD quality that are
well-matched to human performance. Taken together, we improve the state of the
art on AD generation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要