Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets.

INTERSPEECH(2020)

引用 12|浏览101
暂无评分
摘要
We propose a data expansion method for learning a multilingual semantic embedding model using disjoint datasets containing images and their multilingual audio captions. Here, disjoint means that there are no shared images among the multiple language datasets, in contrast to existing works on multilingual semantic embedding based on visually-grounded speech audio, where it has been assumed that each image is associated with spoken captions of multiple languages. Although learning on disjoint datasets is more challenging, we consider it crucial in practical situations. Our main idea is to refer to another paired data when evaluating a loss value regarding an anchor image. We call this scheme "pair expansion". The motivation behind this idea is to utilize even disjoint pairs by finding similarities, or commonalities, that may exist in different images. Specifically, we examine two approaches for calculating similarities: one using image embedding vectors and the other using object recognition results. Our experiments show that expanded pairs improve crossmodal and cross-lingual retrieval accuracy compared with non-expanded cases. They also show that similarities measured by the image embedding vectors yield better accuracy than those based on object recognition results.
更多
查看译文
关键词
Vision and spoken language, multilingual semantic embeddings, disjoint datasets, pair expansion, cross-lingual retrieval
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要