Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair
arxiv(2024)
摘要
In Simultaneous Machine Translation (SiMT) systems, training with a
simultaneous interpretation (SI) corpus is an effective method for achieving
high-quality yet low-latency systems. However, it is very challenging to curate
such a corpus due to limitations in the abilities of annotators, and hence,
existing SI corpora are limited. Therefore, we propose a method to convert
existing speech translation corpora into interpretation-style data, maintaining
the original word order and preserving the entire source content using Large
Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models in
text-to-text and speech-to-text settings with the LLM-SI-Corpus reduces
latencies while maintaining the same level of quality as the models trained
with offline datasets. The LLM-SI-Corpus is available at
.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要