Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning

MM '20: The 28th ACM International Conference on Multimedia Seattle WA USA October, 2020(2020)

引用 23|浏览152
暂无评分
摘要
When we humans tell a long paragraph about an image, we usually first implicitly compose a mental "script'' and then comply with it to generate the paragraph. Inspired by this, we render the modern encoder-decoder based image paragraph captioning model such ability by proposing Hierarchical Scene Graph Encoder-Decoder (HSGED) for generating coherent and distinctive paragraphs. In particular, we use the image scene graph as the "script" to incorporate rich semantic knowledge and, more importantly, the hierarchical constraints into the model. Specifically, we design a sentence scene graph RNN (SSG-RNN) to generate sub-graph level topics, which constrain the word scene graph RNN (WSG-RNN) to generate the corresponding sentences. We propose irredundant attention in SSG-RNN to improve the possibility of abstracting topics from rarely described sub-graphs and inheriting attention in WSG-RNN to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs. An efficient sentence-level loss is also proposed for encouraging the sequence of generated sentences to be similar to that of the ground-truth paragraphs. We validate HSGED on Stanford image paragraph dataset and show that it not only achieves a new state-of-the-art 36.02 CIDEr-D, but also generates more coherent and distinctive paragraphs under various metrics.
更多
查看译文
关键词
Image Paragraph Generation, Scene Graph, Hierarchical Constrain, Hierarchical Scene Graph Encoder Decoder
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要