High-precision Voice Search Query Correction via Retrievable Speech-text Embedings
CoRR(2024)
摘要
Automatic speech recognition (ASR) systems can suffer from poor recall for
various reasons, such as noisy audio, lack of sufficient training data, etc.
Previous work has shown that recall can be improved by retrieving rewrite
candidates from a large database of likely, contextually-relevant alternatives
to the hypothesis text using nearest-neighbors search over embeddings of the
ASR hypothesis text to correct and candidate corrections.
However, ASR-hypothesis-based retrieval can yield poor precision if the
textual hypotheses are too phonetically dissimilar to the transcript truth. In
this paper, we eliminate the hypothesis-audio mismatch problem by querying the
correction database directly using embeddings derived from the utterance audio;
the embeddings of the utterance audio and candidate corrections are produced by
multimodal speech-text embedding networks trained to place the embedding of the
audio of an utterance and the embedding of its corresponding textual transcript
close together.
After locating an appropriate correction candidate using nearest-neighbor
search, we score the candidate with its speech-text embedding distance before
adding the candidate to the original n-best list.
We show a relative word error rate (WER) reduction of 6
transcripts appear in the candidate set, without increasing WER on general
utterances.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要