AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
arxiv(2023)
摘要
Self-supervision has shown great potential for audio-visual speech
recognition by vastly reducing the amount of labeled data required to build
good systems. However, existing methods are either not entirely end-to-end or
do not train joint representations of both modalities. In this paper, we
introduce AV-data2vec which addresses these challenges and builds audio-visual
representations based on predicting contextualized representations which has
been successful in the uni-modal case. The model uses a shared transformer
encoder for both audio and video and can combine both modalities to improve
speech recognition. Results on LRS3 show that AV-data2vec consistently
outperforms existing methods under all settings with the same amount of data
and model size.
更多查看译文
关键词
contextualized target representations,av-data,self-supervised,audio-visual
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要