Czech spontaneous speech corpus with structural metadata

INTERSPEECH(2005)

引用 26|浏览13
暂无评分
摘要
This paper describes a Czech spontaneous speech corpus con- sisting of radio talk show recordings. As the first complete non-English MDE corpus, it has been annotated with struc- tural metadata information beyond the words that is critical to both increasing transcript readability and allowing application of downstream NLP methods. Metadata annotation involves partitioning verbatim transcripts into syntactic/semantic units (SUs) that function to express a complete idea; and identify- ing fillers and edit disfluencies. Annotation guidelines for Eng- lish metadata developed by Linguistic Data Consortium were taken as the starting point, with changes applied to accommo- date specific phenomena of Czech. In addition to the necessary language-dependent modifications, we further propose some language-independent modifications including limited prosodic labeling at SU boundaries. Statistics about the structural meta- data annotation present in the corpus and inter-annotator agree- ment numbers are also presented.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要