On Robustness to Missing Video for Audiovisual Speech Recognition
Trans. Mach. Learn. Res.(2023)
摘要
It has been shown that learning audiovisual features can lead to improved
speech recognition performance over audio-only features, especially for noisy
speech. However, in many common applications, the visual features are partially
or entirely missing, e.g.~the speaker might move off screen. Multi-modal models
need to be robust: missing video frames should not degrade the performance of
an audiovisual model to be worse than that of a single-modality audio-only
model. While there have been many attempts at building robust models, there is
little consensus on how robustness should be evaluated. To address this, we
introduce a framework that allows claims about robustness to be evaluated in a
precise and testable way. We also conduct a systematic empirical study of the
robustness of common audiovisual speech recognition architectures on a range of
acoustic noise conditions and test suites. Finally, we show that an
architecture-agnostic solution based on cascades can consistently achieve
robustness to missing video, even in settings where existing techniques for
robustness like dropout fall short.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要