On integrating the number of synthetic data sets $m$ into the 'a priori' synthesis approach

arXiv (Cornell University)(2022)

引用 0|浏览0
暂无评分
摘要
Until recently, multiple synthetic data sets were always released to analysts, to allow valid inferences to be obtained. However, under certain conditions - including when saturated count models are used to synthesize categorical data - single imputation ($m=1$) is sufficient. Nevertheless, increasing $m$ causes utility to improve, but at the expense of higher risk, an example of the risk-utility trade-off. The question, therefore, is: which value of $m$ is optimal with respect to the risk-utility trade-off? Moreover, the paper considers two ways of analysing categorical data sets: as they have a contingency table representation, multiple categorical data sets can be averaged before being analysed, as opposed to the usual way of averaging post-analysis. This paper also introduces a pair of metrics, $\tau_3(k,d)$ and $\tau_4(k,d)$, that are suited for assessing disclosure risk in multiple categorical synthetic data sets. Finally, the synthesis methods are demonstrated empirically.
更多
查看译文
关键词
synthetic data sets,synthesis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要