Visual Question Answering With a Hybrid Convolution Recurrent Model.

Philipp Harzig,Christian Eggert,Rainer Lienhart

ICMR '18: International Conference on Multimedia Retrieval Yokohama Japan June, 2018（2018）

引用 5|浏览8

暂无评分

摘要

Visual Question Answering (VQA) is a relatively new task, which tries to infer answer sentences for an input image coupled with a corresponding question. Instead of dynamically generating answers, they are usually inferred by finding the most probable answer from a fixed set of possible answers. Previous work did not address the problem of finding all possible answers, but only modeled the answering part of VQA as a classification task. To tackle this problem, we infer answer sentences by using a Long Short-Term Memory (LSTM) network that allows us to dynamically generate answers for (image, question) pairs. In a series of experiments, we discover an end-to-end Deep Neural Network structure, which allows us to dynamically answer questions referring to a given input image by using an LSTM decoder network. With this approach, we are able to generate both less common answers, which are not considered by classification models, and more complex answers with the appearance of datasets containing answers that consist of more than three words.

查看译文

关键词

VQA, Visual Question Answering, multimodal retrieval, natural language generation, LSTM, multimodal fusion

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要