DARTScore: DuAl-Reconstruction Transformer for Video Captioning Evaluation

IEEE Transactions on Circuits and Systems for Video Technology(2023)

引用 0|浏览14
暂无评分
摘要
Video captioning evaluation aims at assessing the semantic consistency between video and candidate text, which should include measurement from two aspects: faithfulness (whether the information conveyed by candidate is correct w.r.t. video) and comprehensiveness (whether the main video content is covered by candidate). However, previous approaches have difficulty in evaluating faithfulness and comprehensiveness due to heavy reliance on references or heterogeneous of visual and textual data. In this paper, we propose a vision-involved evaluation metric based on a novel DuAl-Reconstruction Transformer, named DARTScore. DARTScore formulates the caption evaluation task as a dual-reconstruction problem to evaluate both faithfulness and comprehensiveness explicitly. Since the word in a candidate is usually related to several frames, DARTScore adaptively collects relevant frames to reconstruct the word and computes the reconstruction accuracy as faithfulness to inherently reflect whether the word information is contained in the video. In the inversive way, DARTScore reconstructs each frame with relevant words to evaluate comprehensiveness. By integrating fine-grained bidirectional reconstruction accuracies, DARTScore drills into each word in candidate and each frame in video to fully evaluate the semantic consistency. Furthermore, we collect and annotate two Chinese datasets with a large domain gap, named CRAETE-EVAL and VATEX-ZH-EVAL, to systematically evaluate existing metrics and fill the blank of Chinese video captioning evaluation. Experimental results show that DARTScore achieves higher correlation with human judgments, has lower reference reliance, and generalizes well to data from different domains.
更多
查看译文
关键词
Chinese video captioning evaluation,dual-reconstruction transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要