Train Constrain: Phonologically Informed Tongue-Twister Generation from Topics and Paraphrases
arxiv(2024)
摘要
Previous work in phonologically and phonetically grounded language generation
has mainly focused on domains such as puns and poetry. In this article, we
present new work on the generation of tongue-twisters - a form of language that
is required to be conditioned on a phoneme level to maximize sound overlap,
whilst maintaining semantic consistency with an input topic and still being
grammatically correct. We present TwisterLister, a pipeline for generating
phonologically informed tongue-twisters from Large Language Models (LLMs) that
we use to generate TwistList 2.0, the largest annotated dataset of
tongue-twisters to date, consisting of 17K+ examples from a combination of
human and LLM authors. Our generation pipeline involves the use of a
phonologically constrained vocabulary alongside LLM prompting to generate
novel, non-derivative tongue-twister examples. We additionally present the
results of automatic and human evaluation of smaller models trained on our
generated dataset to demonstrate the extent to which phonologically motivated
language types can be generated without explicit injection of phonological
knowledge. Additionally, we introduce a Phoneme-Aware Constrained Decoding
module (PACD) that can be integrated into any causal language model and
demonstrate that this method generates good quality tongue-twisters both with
and without fine-tuning the underlying language model. We also design and
implement a range of automatic metrics for the task of tongue-twister
generation that is phonologically motivated and captures the unique essence of
tongue-twisters based on Phonemic Edit Distance (PED).
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要