Self-Motion As Supervision For Egocentric Audiovisual Localization

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览0
暂无评分
摘要
Sound source localization is a key requirement for many assistive applications of augmented reality, such as speech enhancement. In conversational settings, potential sources of interest may be approximated by active speaker detection. However, localizing speakers in crowded, noisy environments is challenging, particularly without extensive ground truth annotations. Still, people are often able to communicate effectively in these scenarios through orienting behavioral responses, such as head motion and eye gaze, which have been shown to correlate with directions of auditory sources. In the absence of ground truth annotations, we propose joint training of egocentric audiovisual localization with behavioral pseudolabels to relate audiovisual stimuli with directional information extracted from future behavior. We evaluate this method as a technique for unsupervised egocentric active speaker localization and compare pseudolabels derived from head and gaze directions against fully-supervised alternatives.
更多
查看译文
关键词
active speaker localization,conversational understanding,audiovisual learning,egocentric learning,eye tracking
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要