Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models
CoRR(2024)
摘要
Ensuring the trustworthiness of large language models (LLMs) is crucial. Most
studies concentrate on fully pre-trained LLMs to better understand and improve
LLMs' trustworthiness. In this paper, to reveal the untapped potential of
pre-training, we pioneer the exploration of LLMs' trustworthiness during this
period, focusing on five key dimensions: reliability, privacy, toxicity,
fairness, and robustness. To begin with, we apply linear probing to LLMs. The
high probing accuracy suggests that LLMs in early pre-training can
already distinguish concepts in each trustworthiness dimension. Therefore, to
further uncover the hidden possibilities of pre-training, we extract steering
vectors from a LLM's pre-training checkpoints to enhance the LLM's
trustworthiness. Finally, inspired by that mutual
information estimation is bounded by linear probing accuracy, we also probe
LLMs with mutual information to investigate the dynamics of trustworthiness
during pre-training. We are the first to observe a similar two-phase
phenomenon: fitting and compression . This research
provides an initial exploration of trustworthiness modeling during LLM
pre-training, seeking to unveil new insights and spur further developments in
the field. We will make our code publicly accessible at
.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要