Improving Diversity of Image Captioning Through Variational Autoencoders and Adversarial Learning

2019 IEEE Winter Conference on Applications of Computer Vision (WACV)(2019)

引用 6|浏览1449
暂无评分
摘要
Learning translation from images to human-readable natural language has become a great challenge in computer vision research in recent years. Existing works explore the semantic correlation between the visual and language domains via encoder-to-decoder learning frameworks based on classifying visual features in the language domain. This approach, however, is criticized for its lacking of naturalness and diversity. In this paper, we demonstrate a novel way to learn a semantic connection between visual information and natural language directly based on a Variational Autoencoder (VAE) that is trained in an adversarial routine. Instead of using the classification based discriminator, our method directly learns to estimate the diversity between a hidden vector embedded from a text encoder and an informative feature that is sampled from a learned distribution of the autoencoders. We show that the sentences learned from this matching contains accurate semantic meaning with high diversity in the image captioning task. Our experiments on the popular MSCOCO dataset indicates that our method learns to generate high-quality natural language with competitive scores on both correctness and diversity.
更多
查看译文
关键词
Visualization,Semantics,Gallium nitride,Generative adversarial networks,Training,Generators,Maximum likelihood estimation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要