Internship: probing joint vision-and-language representations
semanticscholar(2020)
摘要
Context Recent advances in deep learning have enabled exciting applications in the context of multimodal processing involving images and texts, such as visual question answering [1], visual dialog [4], image captionning [14], text undersanding in multimodal context [5]... This internship is focused on exploring representations trained to perform such tasks. Vision-and-language representations are typically extracted with neural networks drawing from the transformers architecture, pre-trained with self-supervision on large datasets, such as Conceptual Captions [11] or MSCOCO [10]. The resulting family of architectures generally involve representing objects extracted from the image as embeddings, and concatening them with word embeddings associated to the text, before feeding them to multiple layers of attention mechanisms [12, 13, 9, 2].
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要