Track Everything Everywhere Fast and Robustly
arxiv(2024)
摘要
We propose a novel test-time optimization approach for efficiently and
robustly tracking any pixel at any time in a video. The latest state-of-the-art
optimization-based tracking technique, OmniMotion, requires a prohibitively
long optimization time, rendering it impractical for downstream applications.
OmniMotion is sensitive to the choice of random seeds, leading to unstable
convergence. To improve efficiency and robustness, we introduce a novel
invertible deformation network, CaDeX++, which factorizes the function
representation into a local spatial-temporal feature grid and enhances the
expressivity of the coupling blocks with non-linear functions. While CaDeX++
incorporates a stronger geometric bias within its architectural design, it also
takes advantage of the inductive bias provided by the vision foundation models.
Our system utilizes monocular depth estimation to represent scene geometry and
enhances the objective by incorporating DINOv2 long-term semantics to regulate
the optimization process. Our experiments demonstrate a substantial improvement
in training speed (more than 10 times faster), robustness, and
accuracy in tracking over the SoTA optimization-based method OmniMotion.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要