Time-Contrastive Networks: Self-Supervised Learning from Multi-view Observation

2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)(2017)

引用 145|浏览207
暂无评分
摘要
We propose a self-supervised approach for learning representations of relationships between humans and their environment, including object interactions, attributes, and body pose, entirely from unlabeled videos recorded from multiple viewpoints (Fig. 2). We train an embedding with a triplet loss that contrasts a pair of simultaneous frames from different viewpoints with temporally adjacent and visually similar frames (Fig. 1). We call this model Time- Contrastive Networks (TCN). The contrastive signal encourages the model to discover meaningful dimensions and attributes that can explain the changing state of objects and the world from visually similar frames while learning invariance to viewpoint, occlusions, motion blur, lighting, background. The experimental evaluation of our multiviewpoint embedding technique examines its application to reasoning about object interactions, as well as human pose imitation with a real robot. We demonstrate that our model can correctly identify corresponding steps in complex object interactions, such as pouring (Table 1), between different videos and with different instances. We also show what is, to the best of our knowledge, the first self-supervised results for end-to-end imitation learning of human motions with a real robot (Table 2). Results are best visualized in videos available at 1 and the full paper is available at 2 .
更多
查看译文
关键词
time-contrastive networks,TCN,self-supervised learning,multiview observation,relationship representation learning,object interactions,attributes,body pose,unlabeled videos,contrastive signal,occlusions,motion blur,lighting,invariance learning,multiviewpoint embedding technique,human pose imitation,real robot,end-to-end imitation learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要