Simultaneous or Sequential Training? How Speech Representations Cooperate in a Multi-Task Self-Supervised Learning System
arxiv(2023)
摘要
Speech representation learning with self-supervised algorithms has resulted
in notable performance boosts in many downstream tasks. Recent work combined
self-supervised learning (SSL) and visually grounded speech (VGS) processing
mechanisms for representation learning. The joint training with SSL and VGS
mechanisms provides the opportunity to utilize both unlabeled speech and
speech-related visual information based on data availability. This has shown to
enhance the quality of learned representations, especially at encoding
semantic- and lexical-level knowledge. In this work, we further study the joint
optimization of wav2vec 2.0-based SSL and transformer-based VGS as a multi-task
learning system. We explore a set of training scenarios to understand how
speech representations are shared or transferred between the two tasks, and
what is the optimal training strategy for cross-modal semantic retrieval and
phoneme discrimination performance. As a result, we find that sequential
training with wav2vec 2.0 first and VGS next provides higher performance on
audio-visual retrieval compared to simultaneous optimization of both learning
mechanisms. However, the parallel SSL-VGS training reduces the effects of
catastrophic forgetting when switching between optimization criteria. Moreover,
the results suggest that phonemic representations learned through the VGS
mechanism may generalize better across datasets compared to those learned with
SSL.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要