How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities
arxiv(2023)
摘要
The rapid progress in open-source Large Language Models (LLMs) is
significantly driving AI development forward. However, there is still a limited
understanding of their trustworthiness. Deploying these models at scale without
sufficient trustworthiness can pose significant risks, highlighting the need to
uncover these issues promptly. In this work, we conduct an adversarial
assessment of open-source LLMs on trustworthiness, scrutinizing them across
eight different aspects including toxicity, stereotypes, ethics, hallucination,
fairness, sycophancy, privacy, and robustness against adversarial
demonstrations. We propose advCoU, an extended Chain of Utterances-based (CoU)
prompting strategy by incorporating carefully crafted malicious demonstrations
for trustworthiness attack. Our extensive experiments encompass recent and
representative series of open-source LLMs, including Vicuna, MPT, Falcon,
Mistral, and Llama 2. The empirical outcomes underscore the efficacy of our
attack strategy across diverse aspects. More interestingly, our result analysis
reveals that models with superior performance in general NLP tasks do not
always have greater trustworthiness; in fact, larger models can be more
vulnerable to attacks. Additionally, models that have undergone instruction
tuning, focusing on instruction following, tend to be more susceptible,
although fine-tuning LLMs for safety alignment proves effective in mitigating
adversarial trustworthiness attacks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要