Inter-Intra Cross-Modality Self-Supervised Video Representation Learning by Contrastive Clustering.

ICPR(2022)

引用 0|浏览1
暂无评分
摘要
This paper introduces an online self-supervised method that leverages inter- and infra-level variance for video representation learning. Most existing methods tend to focus on instance-level or inter-variance encoding but ignore the intra-variance existing in clips. The key observation to solving this problem is the underlying correlation between visual and audio, in which the distribution of flow patterns in feature space is diverse, but expresses complementary similar semantics. And in the semantic feature space, the horizontal dimension of the feature matrix could be regarded as cluster labels. These cluster labels should be consistent for different modalities of the same video clip. Based on this idea, we propose an endto-end inter-intra cross-modality contrastive clustering scheme to simultaneously optimize the inter- and intra-level contrastive loss. Experiments show that our proposed approach is able to considerably outperform previous methods for self-supervised learning on HMDB51 and UCF101 when applied to video retrieval and action recognition tasks.
更多
查看译文
关键词
action recognition tasks,cluster labels,contrastive clustering,end-to-end inter-intra cross-modality contrastive clustering scheme,feature matrix,flow pattern distribution,horizontal dimension,instance-level encoding,inter-intra cross-modality self-supervised video representation learning,inter-level contrastive loss optimization,inter-variance encoding,intra-level contrastive loss optimization,intra-level variance encoding,self-supervised learning,self-supervised method,semantic feature space,video clip,video retrieval
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要