Residual attention-based LSTM for video captioning

World Wide Web(2018)

引用 34|浏览41
暂无评分
摘要
Recently great success has been achieved by proposing a framework with hierarchical LSTMs in video captioning, such as stacked LSTM networks. When deeper LSTM layers are able to start converging, a degradation problem has been exposed. With the number of LSTM layers increasing, accuracy gets saturated and then degrades rapidly like standard deep convolutional networks such as VGG. In this paper, we propose a novel attention-based framework, namely Residual Attention-based LSTM (Res-ATT), which not only takes advantage of existing attention mechanism but also considers the importance of sentence internal information which usually gets lost in the transmission process. Our key novelty is that we show how to integrate residual mapping into a hierarchical LSTM network to solve the degradation problem. More specifically, our novel hierarchical architecture builds on two LSTMs layers and residual mapping is introduced to avoid the loss of previous generated words information (i.e., both content information and relationship information). Experimental results on the mainstream datasets: MSVD and MSR-VTT, which shows that our framework outperforms the state-of-the-art approaches. Furthermore, our automatically generated sentences can provide more detailed information to precisely describe a video.
更多
查看译文
关键词
LSTM,Attention mechanism,Residual thought,Video captioning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要