Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding
CoRR(2024)
摘要
In recent years, finding an effective and efficient strategy for exploiting
spatial and temporal information has been a hot research topic in video
saliency prediction (VSP). With the emergence of spatio-temporal transformers,
the weakness of the prior strategies, e.g., 3D convolutional networks and
LSTM-based networks, for capturing long-range dependencies has been effectively
compensated. While VSP has drawn benefits from spatio-temporal transformers,
finding the most effective way for aggregating temporal features is still
challenging. To address this concern, we propose a transformer-based video
saliency prediction approach with high temporal dimension decoding network
(THTD-Net). This strategy accounts for the lack of complex hierarchical
interactions between features that are extracted from the transformer-based
spatio-temporal encoder: in particular, it does not require multiple decoders
and aims at gradually reducing temporal features' dimensions in the decoder.
This decoder-based architecture yields comparable performance to multi-branch
and over-complicated models on common benchmarks such as DHF1K, UCF-sports and
Hollywood-2.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要