Overcoming Challenges of Synthetic Data Generation

Kevin Fang,Vaikkunth Mugunthan, Vayd Ramkumar,Lalana Kagal

2022 IEEE International Conference on Big Data (Big Data)（2022）

引用 1|浏览17

暂无评分

摘要

There are several shortcomings in current methods of generating synthetic data using Generative Adversarial Networks (GANs). First, they tend to only emulate certain attributes of the original dataset. Second, they do not effectively model unbalanced discrete columns, long tails, or bimodal distributions of continuous columns. Lastly, these approaches often do not consider the potential for information leakage from the generated data. We propose UniformGAN, a GAN with a novel uniform loss function, which addresses these challenges and provides strong privacy guarantees using differential privacy. UniformGAN pre-processes datasets to transpose each column into a uniform distribution. We use a modified Deep Convolutional Generative Adversarial Network (DCGAN) architecture in which we replace ReLU activation functions with the more robust SeLU, which has significantly better performance and better convergence properties, and apply Dense-Sparse-Dense training to our network. We also use differential privacy to add noise to the discriminator during training. Along with UniformGAN, we provide a configurable command-line tool to generate and evaluate synthetic datasets on numerous metrics. It allows users to generate synthetic datasets from CTGAN, TableGAN, UniformGAN, or a custom framework and analyze the resultant datasets. This tool will help data scientists and industry users compare different synthetic dataset generation models and enable them to improve existing methods. We evaluated UniformGAN using multiple datasets, including the Adult, Covertype, and Credit Kaggle datasets, as well as two insurance-related Kaggle datasets. The results show that, when used on datatsets containing a large number of continuous columns, UniformGAN out performs other methods by producing synthetic data with similar correlations and distributions as the original dataset while ensuring privacy.

查看译文

关键词

Synthetic data generation,Generative Adversarial Networks (GANs),Differential privacy

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要