Stylenet: Generating Attractive Visual Captions With Styles

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017)(2017)

引用 343|浏览169
暂无评分
摘要
We propose a novel framework named StyleNet to address the task of generating attractive captions for images and videos with different styles. To this end, we devise a novel model component, named factored LSTM, which automatically distills the style factors in the monolingual text corpus. Then at runtime, we can explicitly control the style in the caption generation process so as to produce attractive visual captions with the desired style. Our approach achieves this goal by leveraging two sets of data: 1) factual image/video-caption paired data, and 2) stylized monolingual text data (e.g., romantic and humorous sentences). We show experimentally that StyleNet outperforms existing approaches for generating visual captions with different styles, measured in both automatic and human evaluation metrics on the newly collected FlickrStyle10K image caption dataset, which contains 10K Flickr images with corresponding humorous and romantic captions.
更多
查看译文
关键词
caption generation process,romantic sentences,humorous sentences,StyleNet,automatic evaluation metrics,human evaluation metrics,romantic captions,style factors,monolingual text corpus,model component,attractive visual caption generation,factored LSTM,factual image/video-caption paired data,stylized monolingual text data,FlickrStyle10K image caption dataset,humorous captions
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要