Distribution-Driven, Embedded Synthetic Data Generation System and Tool for RDBMS

2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)(2019)

引用 1|浏览13
暂无评分
摘要
Many self-managing relational database management systems (RDBMS) need to programmatically generate synthetic data to train machine learning models. This paper proposes the concept of shadow database and a framework to derive shadow database from production database that matches distribution properties of source data. Moreover, we have designed and implemented an embedded synthetic data generation tool that takes data distribution profile as input and generates a shadow database according to histograms of source data. The distribution profile is passed into the tool either through an export-import mechanism or as a JSON string. The shadow database can scale to be larger or smaller than the original database and serve as a testbed to train learning models. Unlike most other data generation tools, our tool is implemented as SQL procedures that can be embedded in the underlying RDBMS.
更多
查看译文
关键词
data distribution, histogram, shadow database, synthetic data generation, machine learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要