Entity Resolution in Situated Dialog With Unimodal and Multimodal Transformers

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING(2024)

引用 0|浏览24
暂无评分
摘要
In this work we address the entity resolution task for situated multimodal dialog investigating how a unimodal approach, which uses only textual information as input (representing visual attributes as text), compares to a multimodal system, which processes both text and visual information. We analyze two of the top performing models presented in the Tenth Dialog Systems Technology Challenge and propose modifications that enhance their performance on the multimodal coreference resolution task. We evaluate these approaches on in- and out-of-domain settings by training the models on the fashion domain and testing on the furniture domain, and vice-versa, to assess the generalizability of the models. Through systematic analysis, we show that while both systems achieve similar performance on in-domain scenarios, the multimodal system generalizes better to out-of-domain settings. A combination strategy of enhanced unimodal and multimodal systems achieves F1 = 0.80 (5% absolute gain compared to the best performing system). Finally, human performance on the same task is evaluated on a small subset, suggesting that the performance of the current automatic models is on par with people on this task.
更多
查看译文
关键词
BART,DSTC10,multimodal coreference resolution,SIMMC2.0,transformers,UNITER
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要