Refining Pre-Trained Motion Models
CoRR(2024)
摘要
Given the difficulty of manually annotating motion in video, the current best
motion estimation methods are trained with synthetic data, and therefore
struggle somewhat due to a train/test gap. Self-supervised methods hold the
promise of training directly on real video, but typically perform worse. These
include methods trained with warp error (i.e., color constancy) combined with
smoothness terms, and methods that encourage cycle-consistency in the estimates
(i.e., tracking backwards should yield the opposite trajectory as tracking
forwards). In this work, we take on the challenge of improving state-of-the-art
supervised models with self-supervised training. We find that when the
initialization is supervised weights, most existing self-supervision techniques
actually make performance worse instead of better, which suggests that the
benefit of seeing the new data is overshadowed by the noise in the training
signal. Focusing on obtaining a “clean” training signal from real-world
unlabelled video, we propose to separate label-making and training into two
distinct stages. In the first stage, we use the pre-trained model to estimate
motion in a video, and then select the subset of motion estimates which we can
verify with cycle-consistency. This produces a sparse but accurate
pseudo-labelling of the video. In the second stage, we fine-tune the model to
reproduce these outputs, while also applying augmentations on the input. We
complement this boot-strapping method with simple techniques that densify and
re-balance the pseudo-labels, ensuring that we do not merely train on “easy”
tracks. We show that our method yields reliable gains over fully-supervised
methods in real videos, for both short-term (flow-based) and long-range
(multi-frame) pixel tracking.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要