Kick Back Relax++: Scaling Beyond Ground-Truth Depth with SlowTV CribsTV
arxiv(2024)
摘要
Self-supervised learning is the key to unlocking generic computer vision
systems. By eliminating the reliance on ground-truth annotations, it allows
scaling to much larger data quantities. Unfortunately, self-supervised
monocular depth estimation (SS-MDE) has been limited by the absence of diverse
training data. Existing datasets have focused exclusively on urban driving in
densely populated cities, resulting in models that fail to generalize beyond
this domain.
To address these limitations, this paper proposes two novel datasets: SlowTV
and CribsTV. These are large-scale datasets curated from publicly available
YouTube videos, containing a total of 2M training frames. They offer an
incredibly diverse set of environments, ranging from snowy forests to coastal
roads, luxury mansions and even underwater coral reefs. We leverage these
datasets to tackle the challenging task of zero-shot generalization,
outperforming every existing SS-MDE approach and even some state-of-the-art
supervised methods.
The generalization capabilities of our models are further enhanced by a range
of components and contributions: 1) learning the camera intrinsics, 2) a
stronger augmentation regime targeting aspect ratio changes, 3) support frame
randomization, 4) flexible motion estimation, 5) a modern transformer-based
architecture. We demonstrate the effectiveness of each component in extensive
ablation experiments. To facilitate the development of future research, we make
the datasets, code and pretrained models available to the public at
https://github.com/jspenmar/slowtv_monodepth.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要