Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment

International Multimedia Conference(2021)

引用 25|浏览567
暂无评分
摘要
ABSTRACTHow can we generalize to a new prediction task at test time when it also uses a new modality as input? More importantly, how can we do this with as little annotated data as possible? This problem of cross-modal generalization is a new research milestone with concrete impact on real-world applications. For example, can an AI system start understanding spoken language from mostly written text? Or can it learn the visual steps of a new recipe from only text descriptions? In this work, we formalize cross-modal generalization as a learning paradigm to train a model that can (1) quickly perform new tasks (from new domains) while (2) being originally trained on a different input modality. Such a learning paradigm is crucial for generalization to low-resource modalities such as spoken speech in rare languages while utilizing a different high-resource modality such as text. One key technical challenge that makes it different from other learning paradigms such as meta-learning and domain adaptation is the presence of different source and target modalities which will require different encoders. We propose an effective solution based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data while ensuring quick generalization to new tasks across different modalities. This approach uses key ideas from cross-modal learning and meta-learning, and presents strong results on the cross-modal generalization problem. We benchmark several approaches on 3 real-world classification tasks: few-shot recipe classification from text to images of recipes, object classification from images to audio of objects, and language classification from text to spoken speech across 100 languages spanning many rare languages. Our results demonstrate strong performance even when the new target modality has only a few (1-10) labeled samples and in the presence of noisy labels, a scenario particularly prevalent in low-resource modalities.
更多
查看译文
关键词
generalization,low resource modalities,learning,cross-modal,meta-alignment
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要