Quantifying and Mitigating Privacy Risks for Tabular Generative Models
arxiv(2024)
摘要
Synthetic data from generative models emerges as the privacy-preserving
data-sharing solution. Such a synthetic data set shall resemble the original
data without revealing identifiable private information. The backbone
technology of tabular synthesizers is rooted in image generative models,
ranging from Generative Adversarial Networks (GANs) to recent diffusion models.
Recent prior work sheds light on the utility-privacy tradeoff on tabular data,
revealing and quantifying privacy risks on synthetic data. We first conduct an
exhaustive empirical analysis, highlighting the utility-privacy tradeoff of
five state-of-the-art tabular synthesizers, against eight privacy attacks, with
a special focus on membership inference attacks. Motivated by the observation
of high data quality but also high privacy risk in tabular diffusion, we
propose DP-TLDM, Differentially Private Tabular Latent Diffusion Model, which
is composed of an autoencoder network to encode the tabular data and a latent
diffusion model to synthesize the latent tables. Following the emerging f-DP
framework, we apply DP-SGD to train the auto-encoder in combination with batch
clipping and use the separation value as the privacy metric to better capture
the privacy gain from DP algorithms. Our empirical evaluation demonstrates that
DP-TLDM is capable of achieving a meaningful theoretical privacy guarantee
while also significantly enhancing the utility of synthetic data. Specifically,
compared to other DP-protected tabular generative models, DP-TLDM improves the
synthetic quality by an average of 35
for downstream tasks, and 50
comparable level of privacy risk.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要