Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement
CoRR(2024)
摘要
The relationship between language model tokenization and performance is an
open area of research. Here, we investigate how different tokenization schemes
impact number agreement in Spanish plurals. We find that
morphologically-aligned tokenization performs similarly to other tokenization
schemes, even when induced artificially for words that would not be tokenized
that way during training. We then present exploratory analyses demonstrating
that language model embeddings for different plural tokenizations have similar
distributions along the embedding space axis that maximally distinguishes
singular and plural nouns. Our results suggest that morphologically-aligned
tokenization is a viable tokenization approach, and existing models already
generalize some morphological patterns to new items. However, our results
indicate that morphological tokenization is not strictly required for
performance.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要