Adversarial Preference Learning for Robust LLM Alignment

Yuanfu Wang, Pengyu Wang,Chenyang Xi,Bo Tang, Junyi Zhu, Wenqiang Wei, Chen, Chao Yang,Jingfeng Zhang, Chaochao Lu, Yijun Niu, Keming Mao,Zhiyu Li,Feiyu Xiong,Jie Hu,Mingchuan Yang

Annual Meeting of the Association for Computational Linguistics（2025）

Cited 0|Views6

Abstract

Modern language models often rely on Reinforcement Learning from Human Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to adversarial attacks due to three key limitations: (1) the inefficiency and high cost of human annotation, (2) the vast diversity of potential adversarial attacks, and (3) the risk of feedback bias and reward hacking. To address these challenges, we introduce Adversarial Preference Learning (APL), an iterative adversarial training method incorporating three key innovations. First, a direct harmfulness metric based on the model's intrinsic preference probabilities, eliminating reliance on external assessment. Second, a conditional generative attacker that synthesizes input-specific adversarial variations. Third, an iterative framework with automated closed-loop feedback, enabling continuous adaptation through vulnerability discovery and mitigation. Experiments on Mistral-7B-Instruct-v0.3 demonstrate that APL significantly enhances robustness, achieving 83.33 (evaluated by GPT-4o), reducing harmful outputs from 5.88 by LLaMA-Guard), and lowering attack success rate by up to 65 HarmBench. Notably, APL maintains competitive utility, with an MT-Bench score of 6.59 (comparable to the baseline 6.78) and an LC-WinRate of 46.52 the base model.

Translated text

Bibtex

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

Summary is being generated by the instructions you defined