A multi-purpose audio-visual corpus for multi-modal Persian speech recognition: The Arman-AV dataset

EXPERT SYSTEMS WITH APPLICATIONS(2024)

引用 0|浏览17
暂无评分
摘要
Automatic lip reading has advanced significantly in recent years. However, these methods need large-scale datasets that are scarce for many low-resource languages. In this paper, we introduce a new multipurpose audio-visual dataset for Persian. The dataset contains approximately 220 h of videos from 1760 speakers. The dataset can be used for multiple tasks, such as lip reading, automatic speech recognition, audio-visual speech recognition, and speaker recognition. It is also the first large-scale lip reading dataset in this language. We provide a baseline method for each task and propose a technique to identify visemes (visual units of speech) in Persian. The visemes obtained by this technique improve the accuracy of the lip reading task by 7% relatively compared to the previously proposed visemes, which can be generalized to other languages as well.
更多
查看译文
关键词
Persian dataset,Audio-visual speech recognition,Lip reading,Viseme
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要