Switchable Novel Object Captioner

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE(2023)

引用 24|浏览102
暂无评分
摘要
Image captioning aims at automatically describing images by sentences. It often requires lots of paired image-sentence data for training. However, trained captioning models can hardly be applied to new domains in which some novel words exist. In this paper, we introduce the zero-shot novel object captioning task, where the machine generates descriptions about novel objects without extra training sentences. To tackle the challenging task, we mimic the way that babies talk about something unknown, i.e., using the word of a similar known object. Following this motivation, we build a key-value object memory by detection models, containing visual information and corresponding words for objects in the image. For those novel objects, we use words of most similar seen objects as proxy visual words to solve the out-of-vocabulary issue. We then propose a Switchable LSTM that incorporates knowledge from the object memory into sentence generation. The model has two switchable working modes, i.e., 1) generating the sentences like standard LSTMs and 2) retrieving proper nouns from the key-value memory. Thus our model is learned to fully disentangle language generation from training objects, and requires zero training sentence in describing novel objects. Experiments on three large-scale datasets demonstrate the ability of our method to describe novel concepts.
更多
查看译文
关键词
Image captioning,novel object captioning,zero-shot learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要