SwinVI:3D Swin Transformer Model with U-net for Video Inpainting.

IJCNN(2023)

引用 0|浏览14
暂无评分
摘要
The goal of video inpainting is to fill in the local missingness of a given video as realistic as possible, it remains a challenging task, even with powerful deep learning methods. In recent years, Transformer has been introduced to video inpainting, and remarkable improvement has been achieved. However, it still suffers from the problems of generating blurry texture and requiring high computational cost. To address the two problems, we propose a new 3D Swin Transformer model (SwinVI) with U-net to improve the quality of video inpainting efficiently. We modify the vanilla Swin Transformer by extending the standard self-attention mechanism to a 3D self-attention mechanism, which enables the modified model to process spatio-temporal information simultaneously. SwinVI consists of U-net implemented by 3D Patch Merge and CNN-equipped upsampling module, which provides an end-to-end learning framework. This structural design empowers SwinVI to fully focus on background textures and moving objects to learn robust and more representative token vectors. Accordingly, to significantly improve the quality of video inpainting efficiently. We experimentally compare SwinVI with multiple methods on two challenging benchmarks. Experimental results demonstrate that the proposed SwinVI outperforms the state-of-the-art methods in RMSE, SSIM, and PSNR.
更多
查看译文
关键词
Transformer, Video inpainting, Spatio-temporal
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要