DeViT: Deformed Vision Transformers in Video Inpainting

International Multimedia Conference(2022)

引用 5|浏览14
暂无评分
摘要
ABSTRACTThis paper presents a novel video inpainting architecture named Deformed Vision Transformers (DeViT). We make three significant contributions to this task: First, we extended previous Transformers with patch alignment by introducing Deformed Patch-based Homography Estimator (DePtH), which enriches the patch-level feature alignments in key and query with additional offsets learned from patch pairs without additional supervision. DePtH enables our method to handle challenging scenes or agile motion with in-plane or out-of-plane deformation, which previous methods usually fail. Second, we introduce the Mask Pruning-based Patch Attention (MPPA) to improve the standard patch-wised feature matching by pruning out less essential features and considering the saliency map. MPPA enhances the matching accuracy between warped tokens with invalid pixels. Third, we introduce the Spatial-Temporal weighting Adaptor (STA) module to assign more accurate attention to spatial-temporal tokens under the guidance of the Deformation Factor learned from DePtH, especially for videos with agile motions. Experimental results demonstrate that our method outperforms previous state-of-the-art methods in quality and quantity and achieves a new state-of-the-art for video inpainting.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要