Speech retrieval from unsegmented finnish audio using statistical morpheme-like units for segmentation, recognition, and retrieval

ACM Transactions on Speech and Language Processing(2011)

引用 18|浏览23
暂无评分
摘要
This article examines the use of statistically discovered morpheme-like units for Spoken Document Retrieval (SDR). The morpheme-like units (morphs) are used both for language modeling in speech recognition and as index terms. Traditional word-based methods suffer from out-of-vocabulary words. If a word is not in the recognizer vocabulary, any occurrence of the word in speech will be missing from the transcripts. The problem is especially severe for languages with a high number of distinct word forms such as Finnish. With the morph language model, even previously unseen words can be recognized by identifying its component morphs. Similarly in information retrieval queries, complex word forms, even unseen ones, can be matched to data after segmenting them to morphs. Retrieval performance can be further improved by expanding the transcripts with alternative recognition results from confusion networks. In this article, a novel retrieval evaluation corpus consisting of unsegmented Finnish radio programs, 25 queries and corresponding human relevance assessments was constructed. Previous results on using morphs and confusion networks for Finnish SDR are confirmed and extended to the unsegmented case. As previously, using morphs or base forms as index terms yields about equal performance but combination methods, including a new one, are found to work better than either alone. Using alternative morph segmentations of the query words is found to further improve the results. Lexical similarity-based story segmentation was applied and performance using morphs, base forms, and their combinations was compared for the first time.
更多
查看译文
关键词
finnish sdr,complex word form,distinct word form,confusion network,unsegmented finnish audio,query word,unseen word,statistical morpheme-like unit,component morphs,base form,speech retrieval,out-of-vocabulary word,morpheme-like unit,indexing terms,information retrieval,speech recognition,lattices,indexation,morphemes,language model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要