Self-supervised learning of video representations from a child's perspective
CoRR(2024)
摘要
Children learn powerful internal models of the world around them from a few
years of egocentric visual experience. Can such internal models be learned from
a child's visual experience with highly generic learning algorithms or do they
require strong inductive biases? Recent advances in collecting large-scale,
longitudinal, developmentally realistic video datasets and generic
self-supervised learning (SSL) algorithms are allowing us to begin to tackle
this nature vs. nurture question. However, existing work typically focuses on
image-based SSL algorithms and visual capabilities that can be learned from
static images (e.g. object recognition), thus ignoring temporal aspects of the
world. To close this gap, here we train self-supervised video models on
longitudinal, egocentric headcam recordings collected from a child over a two
year period in their early development (6-31 months). The resulting models are
highly effective at facilitating the learning of action concepts from a small
number of labeled examples; they have favorable data size scaling properties;
and they display emergent video interpolation capabilities. Video models also
learn more robust object representations than image-based models trained with
the exact same data. These results suggest that important temporal aspects of a
child's internal model of the world may be learnable from their visual
experience using highly generic learning algorithms and without strong
inductive biases.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要