Internship: probing joint vision-and-language representations

Emmanuelle Salin, Stephane Ayache, Benoit Favre,Stanislaw Antol,Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,Yen-Chun Chen,Linjie Li,Licheng Yu, Ahmed El Kholy, Faisal Ahmed,Zhe Gan, Yu Cheng,Jingjing Liu, Kevin Clark, Urvashi Khandelwal, Omer Levy, Abhishek Das, Satwik Kottur,Khushi Gupta,Avi Singh

semanticscholar(2020)

引用 0|浏览1
暂无评分
摘要
Context Recent advances in deep learning have enabled exciting applications in the context of multimodal processing involving images and texts, such as visual question answering [1], visual dialog [4], image captionning [14], text undersanding in multimodal context [5]... This internship is focused on exploring representations trained to perform such tasks. Vision-and-language representations are typically extracted with neural networks drawing from the transformers architecture, pre-trained with self-supervision on large datasets, such as Conceptual Captions [11] or MSCOCO [10]. The resulting family of architectures generally involve representing objects extracted from the image as embeddings, and concatening them with word embeddings associated to the text, before feeding them to multiple layers of attention mechanisms [12, 13, 9, 2].
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要