IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages
arxiv(2024)
摘要
We present INDICVOICES, a dataset of natural and spontaneous speech
containing a total of 7348 hours of read (9
conversational (17
and 22 languages. Of these 7348 hours, 1639 hours have already been
transcribed, with a median of 73 hours per language. Through this paper, we
share our journey of capturing the cultural, linguistic and demographic
diversity of India to create a one-of-its-kind inclusive and representative
dataset. More specifically, we share an open-source blueprint for data
collection at scale comprising of standardised protocols, centralised tools, a
repository of engaging questions, prompts and conversation scenarios spanning
multiple domains and topics of interest, quality control mechanisms,
comprehensive transcription guidelines and transcription tools. We hope that
this open source blueprint will serve as a comprehensive starter kit for data
collection efforts in other multilingual regions of the world. Using
INDICVOICES, we build IndicASR, the first ASR model to support all the 22
languages listed in the 8th schedule of the Constitution of India. All the
data, tools, guidelines, models and other materials developed as a part of this
work will be made publicly available
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要