On the Learnability of Watermarks for Language Models
CoRR(2023)
摘要
Watermarking of language model outputs enables statistical detection of
model-generated text, which has many applications in the responsible deployment
of language models. Existing watermarking strategies operate by altering the
decoder of an existing language model, and the ability for a language model to
directly learn to generate the watermark would have significant implications
for the real-world deployment of watermarks. First, learned watermarks could be
used to build open models that naturally generate watermarked text, allowing
for open models to benefit from watermarking. Second, if watermarking is used
to determine the provenance of generated text, an adversary can hurt the
reputation of a victim model by spoofing its watermark and generating damaging
watermarked text. To investigate the learnability of watermarks, we propose
watermark distillation, which trains a student model to behave like a teacher
model that uses decoding-based watermarking. We test our approach on three
distinct decoding-based watermarking strategies and various hyperparameter
settings, finding that models can learn to generate watermarked text with high
detectability. We also find limitations to learnability, including the loss of
watermarking capabilities under fine-tuning on normal text and high sample
complexity when learning low-distortion watermarks.
更多查看译文
关键词
watermarking,large language models,distillation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要