CLIP Can Understand Depth
CoRR(2024)
摘要
Recent studies on generalizing CLIP for monocular depth estimation reveal
that CLIP pre-trained on web-crawled data is inefficient for deriving proper
similarities between image patches and depth-related prompts. In this paper, we
adapt CLIP for meaningful quality of monocular depth estimation with dense
prediction, without fine-tuning its original vision-language alignment. By
jointly training a compact deconvolutional decoder with a tiny learnable
embedding matrix named mirror, as a static prompt for its text encoder, CLIP is
enabled to understand depth. With this approach, our model exhibits impressive
performance matching several previous state-of-the-art vision-only models on
the NYU Depth v2 and KITTI datasets, outperforming every CLIP-based depth
estimation model with a large margin. Experiments on temporal depth consistency
and spatial continuity demonstrate that the prior knowledge of CLIP can be
effectively refined by our proposed framework. Furthermore, an ablation study
on mirror proves that the resulting model estimates depth utilizing knowledge
not only from the image encoder but also text encoder despite not being given
any prompt written in a human way. This research demonstrates that through
minimal adjustments, the prior knowledge of vision-language foundation models,
such as CLIP, can be generalized even to domains where learning during
pretraining is challenging. We facilitate future works focused on methods to
adjust suboptimal prior knowledge of vision-language models using non-human
language prompts, achieving performance on par with task-specific
state-of-the-art methodologies.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要