An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
arxiv(2024)
摘要
We explore threshold vocabulary trimming in Byte-Pair Encoding subword
tokenization, a postprocessing step that replaces rare subwords with their
component subwords. The technique is available in popular tokenization
libraries but has not been subjected to rigorous scientific scrutiny. While the
removal of rare subwords is suggested as best practice in machine translation
implementations, both as a means to reduce model size and for improving model
performance through robustness, our experiments indicate that, across a large
space of hyperparameter settings, vocabulary trimming fails to improve
performance, and is even prone to incurring heavy degradation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要