Unsupervised Multimodal Video-To-Video Translation Via Self-Supervised Learning

2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021)(2021)

引用 4|浏览84
暂无评分
摘要
Existing unsupervised video-to-video translation methods fail to produce translated videos which are frame-wise realistic, semantic information preserving and video-level consistent. In this work, we propose UVIT, a novel unsupervised video-to-video translation model. Our model decomposes the style and the content, uses the specialized encoder-decoder structure and propagates the inter-frame information through bidirectional recurrent neural network (RNN) units. The style-content decomposition mechanism enables us to achieve style consistent video translation results as well as provides us with a good interface for modality flexible translation. In addition, by changing the input frames and style codes incorporated in our translation, we propose a video interpolation loss, which captures temporal information within the sequence to train our building blocks in a self-supervised manner. Our model can produce photo-realistic, spatio-temporal consistent translated videos in a multimodal way. Subjective and objective experimental results validate the superiority of our model over existing methods.
更多
查看译文
关键词
unsupervised multimodal video-to-video translation,video-to-video translation methods,semantic information,video-level,video-to-video translation model,specialized encoder-decoder structure,inter-frame information,bidirectional recurrent neural network units,style-content decomposition mechanism,style consistent video translation results,modality flexible translation,video interpolation loss,spatio-temporal consistent translated videos
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要