Explanation as a Watermark: Towards Harmless and Multi-bit Model Ownership Verification via Watermarking Feature Attribution
arxiv(2024)
摘要
Ownership verification is currently the most critical and widely adopted
post-hoc method to safeguard model copyright. In general, model owners exploit
it to identify whether a given suspicious third-party model is stolen from them
by examining whether it has particular properties `inherited' from their
released models. Currently, backdoor-based model watermarks are the primary and
cutting-edge methods to implant such properties in the released models.
However, backdoor-based methods have two fatal drawbacks, including harmfulness
and ambiguity. The former indicates that they introduce maliciously
controllable misclassification behaviors (i.e., backdoor) to the watermarked
released models. The latter denotes that malicious users can easily pass the
verification by finding other misclassified samples, leading to ownership
ambiguity.
In this paper, we argue that both limitations stem from the `zero-bit' nature
of existing watermarking schemes, where they exploit the status (i.e.,
misclassified) of predictions for verification. Motivated by this
understanding, we design a new watermarking paradigm, i.e., Explanation as a
Watermark (EaaW), that implants verification behaviors into the explanation of
feature attribution instead of model predictions. Specifically, EaaW embeds a
`multi-bit' watermark into the feature attribution explanation of specific
trigger samples without changing the original prediction. We correspondingly
design the watermark embedding and extraction algorithms inspired by
explainable artificial intelligence. In particular, our approach can be used
for different tasks (e.g., image classification and text generation).
Extensive experiments verify the effectiveness and harmlessness of our EaaW and
its resistance to potential attacks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要