A Data Selection Method for Domain-specific Machine Translation

Na Ye, Ling Jiang,Dongfeng Cai

2023 3rd International Conference on Frontiers of Electronics, Information and Computation Technologies (ICFEICT)(2023)

引用 0|浏览0
暂无评分
摘要
Neural machine translation performs well in general domains with large-scale bilingual corpora, but it is not very effective in specific domains with only a small amount of bilingual corpora. Data selection technology is one of the main methods to alleviate the problem of bilingual data scarcity in specific domains by selecting corpus that is close to the domain corpus from large-scale general domain corpora for data expansion. Due to the large scale of general data, traditional dimension reduction methods are inefficient when dealing with massive data. This paper proposes a two-stage data selection strategy, which improves the efficiency of data selection and the data quality at the same time. First, the Inverted File System Product Quantization (IVFPQ) method is used to preliminarily select data, which improves the computational efficiency and the quality of data selection. In order to obtain data with higher quality, on this basis, this paper trains a domain classifier to fully consider the domain characteristics, and further selects higher quality data. Experimental results show that compared with the baseline method, a better domain-specific machine translation model can be trained using the expanded data obtained by this method.
更多
查看译文
关键词
Domain adaptation,Data selection,Domain classifier
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要