TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning
CoRR(2024)
摘要
It is challenging to perform question-answering over complex, multimodal
content such as television clips. This is in part because current
video-language models rely on single-modality reasoning, have lowered
performance on long inputs, and lack interpetability. We propose TV-TREES, the
first multimodal entailment tree generator. TV-TREES serves as an approach to
video understanding that promotes interpretable joint-modality reasoning by
producing trees of entailment relationships between simple premises directly
entailed by the videos and higher-level conclusions. We then introduce the task
of multimodal entailment tree generation to evaluate the reasoning quality of
such methods. Our method's experimental results on the challenging TVQA dataset
demonstrate intepretable, state-of-the-art zero-shot performance on full video
clips, illustrating a best of both worlds contrast to black-box methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要