Don't Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection
CoRR(2024)
摘要
The fairness and trustworthiness of Large Language Models (LLMs) are
receiving increasing attention. Implicit hate speech, which employs indirect
language to convey hateful intentions, occupies a significant portion of
practice. However, the extent to which LLMs effectively address this issue
remains insufficiently examined. This paper delves into the capability of LLMs
to detect implicit hate speech (Classification Task) and express confidence in
their responses (Calibration Task). Our evaluation meticulously considers
various prompt patterns and mainstream uncertainty estimation methods. Our
findings highlight that LLMs exhibit two extremes: (1) LLMs display excessive
sensitivity towards groups or topics that may cause fairness issues, resulting
in misclassifying benign statements as hate speech. (2) LLMs' confidence scores
for each method excessively concentrate on a fixed range, remaining unchanged
regardless of the dataset's complexity. Consequently, the calibration
performance is heavily reliant on primary classification accuracy. These
discoveries unveil new limitations of LLMs, underscoring the need for caution
when optimizing models to ensure they do not veer towards extremes. This serves
as a reminder to carefully consider sensitivity and confidence in the pursuit
of model fairness.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要