RDIR: Capturing Temporally-Invariant Representations of Multiple Objects in Videos.

Piotr Zielinski, Tomasz Kajdanowicz

IEEE/CVF Winter Conference on Applications of Computer Vision(2024)

引用 0|浏览0
暂无评分
摘要
Learning temporally coherent representations of multiple objects in videos is crucial for understanding their complex dynamics and interactions over time. In this paper, we present a deep generative neural network, which can learn such representations by leveraging pretraining. Our model builds upon a scale-invariant structured autoencoder, extending it with a convolutional recurrent module to refine the learned representations through time and enable information sharing among multiple cells in multi-scale grids. This novel approach provides a framework for learning perobject representations from a pretrained object detection model, offering the ability to infer predefined types of objects, without the need for supervision. Through a series of experiments on benchmark datasets and real-life video footage, we demonstrate the spatial and temporal coherence of the learned representations, showcasing their applicability in downstream tasks such as object tracking. We analyze the method's robustness by conducting an ablation study, and we compare it to other methods, highlighting the importance of the quality of objects' representations.
更多
查看译文
关键词
Video Object,Object Detection,Representation Learning,Object Tracking,Quality Of Representations,Temporal Coherence,Object Detection Model,Feature Maps,Bounding Box,Latent Space,Part Of The Image,Object Size,Latent Representation,Variational Autoencoder,Objects In The Scene,Relative Depth,MNIST Dataset,Scene Understanding,Object Appearance,COCO Dataset,Single Shot Detector,Sequence Of Objects,Stable Representation,Sequence Encoding,Multi-scale Feature Maps,Scene Depth,Validation Subset,Classification Confidence,Number Of Objects,Input Image
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要