Fair Coresets via Optimal Transport
CoRR(2023)
摘要
Data distillation and coresets have emerged as popular approaches to generate
a smaller representative set of samples for downstream learning tasks to handle
large-scale datasets. At the same time, machine learning is being increasingly
applied to decision-making processes at a societal level, making it imperative
for modelers to address inherent biases towards subgroups present in the data.
Current approaches create fair synthetic representative samples by optimizing
local properties relative to the original samples, but their effect on
downstream learning processes has yet to be explored. In this work, we present
fair Wasserstein coresets (FWC), a novel coreset approach which generates fair
synthetic representative samples along with sample-level weights to be used in
downstream learning tasks. FWC minimizes the Wasserstein distance between the
original dataset and the weighted synthetic samples while enforcing demographic
parity. We show that an unconstrained version of FWC is equivalent to Lloyd's
algorithm for k-medians and k-means clustering. Experiments conducted on both
synthetic and real datasets show that FWC: (i) achieves a competitive
fairness-performance tradeoff in downstream models compared to existing
approaches, (ii) improves downstream fairness when added to the existing
training data and (iii) can be used to reduce biases in predictions from large
language models (GPT-3.5 and GPT-4).
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要