Finetuning Language Models for Multimodal Question Answering

Xin Zhang,Wen Xie, Ziqi Dai,Jun Rao, Haokun Wen, Xuan Luo,Meishan Zhang,Min Zhang

MM '23: Proceedings of the 31st ACM International Conference on Multimedia(2023)

引用 0|浏览13
暂无评分
摘要
To achieve multi-modal intelligence, AI must be able to process and respond to inputs from multimodal sources. However, many current question answering models are limited to specific types of answers, such as yes/no and number, and require additional human assessments. Recently, Visual-Text Question Answering (VQTA) dataset has been proposed to fix this gap. In this paper, we conduct an exhaustive analysis and exploration of this task. Specifically, we implement a T5-based multi-modal generative network that overcomes the limitations of traditional labeling space and provides more freedom in responses. Our approach achieve the best performance in both English and Chinese tracks in the VTQA challenge.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要