Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems
CoRR(2024)
摘要
Large language models (LLMs) are trained on text-only data that go far beyond
the languages with paired speech and text data. At the same time, Dual Encoder
(DE) based retrieval systems project queries and documents into the same
embedding space and have demonstrated their success in retrieval and bi-text
mining. To match speech and text in many languages, we propose using LLMs to
initialize multi-modal DE retrieval systems. Unlike traditional methods, our
system doesn't require speech data during LLM pre-training and can exploit
LLM's multilingual text understanding capabilities to match speech and text in
languages unseen during retrieval training. Our multi-modal LLM-based retrieval
system is capable of matching speech and text in 102 languages despite only
training on 21 languages. Our system outperforms previous systems trained
explicitly on all 102 languages. We achieve a 10
Recall@1 averaged across these languages. Additionally, our model demonstrates
cross-lingual speech and text matching, which is further enhanced by readily
available machine translation data.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要