Seeds of Stereotypes: A Large-Scale Textual Analysis of Race and Gender Associations with Diseases in Online Sources
arxiv(2024)
摘要
Background Advancements in Large Language Models (LLMs) hold transformative
potential in healthcare, however, recent work has raised concern about the
tendency of these models to produce outputs that display racial or gender
biases. Although training data is a likely source of such biases, exploration
of disease and demographic associations in text data at scale has been limited.
Methods We conducted a large-scale textual analysis using a dataset
comprising diverse web sources, including Arxiv, Wikipedia, and Common Crawl.
The study analyzed the context in which various diseases are discussed
alongside markers of race and gender. Given that LLMs are pre-trained on
similar datasets, this approach allowed us to examine the potential biases that
LLMs may learn and internalize. We compared these findings with actual
demographic disease prevalence as well as GPT-4 outputs in order to evaluate
the extent of bias representation.
Results Our findings indicate that demographic terms are disproportionately
associated with specific disease concepts in online texts. gender terms are
prominently associated with disease concepts, while racial terms are much less
frequently associated. We find widespread disparities in the associations of
specific racial and gender terms with the 18 diseases analyzed. Most
prominently, we see an overall significant overrepresentation of Black race
mentions in comparison to population proportions.
Conclusions Our results highlight the need for critical examination and
transparent reporting of biases in LLM pretraining datasets. Our study suggests
the need to develop mitigation strategies to counteract the influence of biased
training data in LLMs, particularly in sensitive domains such as healthcare.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要