A comparison of synthetic data generation and federated analysis for performing an international assessment of gender effects on cardiovascular health

Z. Azizi, S. Lindner, Y. Shiba,V. Raparelli, C. Norris,K. Kublickiene, M Trinidad Herrero, A. Kautzky-Willer,P. Klimek, K. El Emam,L. Pilote, t Investigators

Canadian Journal of Cardiology(2022)

引用 0|浏览9
暂无评分
摘要
BackgroundCardiovascular diseases (CVD) are the leading cause of mortality and morbidity worldwide. Whether sex is associated with outcomes in patients with CVD differently across countries remains unknown. Assessing the interaction between sex and psycho-socio-cultural factors (gender) and country requires merging of country specific databases. Privacy concerns are barriers to data access and sharing. Therefore, we assessed the feasibility of pooling data from Canadian and Austrian populations to assess country-level differences in the role of sex, gender in cardiovascular health (CVH) using federated analysis and data synthesis.Methods and ResultsThe datasets used were from the Canadian Community Health Survey (CCHS), and the Austrian Health Interview Survey (ATHIS) in 2014. Only CCHS dataset was synthesized using sequential classification and regression trees. The privacy of the CCHS synthetic data was assessed using a membership disclosure test and F1 score. The low value means that the dataset can be deemed as having low privacy risks. Once it was deemed to be non-personal information, the synthetic dataset was sent to the Austrian team for pooling and analysis. The analysis was performed on the pooled source ATHIS data and the synthetic CCHS data. The outcome variable was CVH, calculated through a modified CANHEART index in both countries. The utility of the pooled dataset was evaluated by comparing the regression model with the model constructed from federated analysis using DataSHIELD. A significant time elapsed to set-up the necessary servers in multiple locations with the requisite security protocols for the federated analysis. For assessing Privacy Risks of Synthetic Data, the largest membership disclosure F1 score across different attack datasets was 0.001, indicating low privacy risk. A comparison of the marginal distributions between males and females showed consistent results in the federated and pooled analyses of synthetic data. In the multivariate analysis of the main effects, the parameter estimates of the federated and pooled analysis were directionally the same as for the univariate analysis. In the multivariate analyses considering the country interactions to determine whether country moderates the relationship between the other variables and CVH, the impact of several factors differed between countries (Table 1).ConclusionView Large Image Figure ViewerDownload Hi-res image Download (PPT) BackgroundCardiovascular diseases (CVD) are the leading cause of mortality and morbidity worldwide. Whether sex is associated with outcomes in patients with CVD differently across countries remains unknown. Assessing the interaction between sex and psycho-socio-cultural factors (gender) and country requires merging of country specific databases. Privacy concerns are barriers to data access and sharing. Therefore, we assessed the feasibility of pooling data from Canadian and Austrian populations to assess country-level differences in the role of sex, gender in cardiovascular health (CVH) using federated analysis and data synthesis. Cardiovascular diseases (CVD) are the leading cause of mortality and morbidity worldwide. Whether sex is associated with outcomes in patients with CVD differently across countries remains unknown. Assessing the interaction between sex and psycho-socio-cultural factors (gender) and country requires merging of country specific databases. Privacy concerns are barriers to data access and sharing. Therefore, we assessed the feasibility of pooling data from Canadian and Austrian populations to assess country-level differences in the role of sex, gender in cardiovascular health (CVH) using federated analysis and data synthesis. Methods and ResultsThe datasets used were from the Canadian Community Health Survey (CCHS), and the Austrian Health Interview Survey (ATHIS) in 2014. Only CCHS dataset was synthesized using sequential classification and regression trees. The privacy of the CCHS synthetic data was assessed using a membership disclosure test and F1 score. The low value means that the dataset can be deemed as having low privacy risks. Once it was deemed to be non-personal information, the synthetic dataset was sent to the Austrian team for pooling and analysis. The analysis was performed on the pooled source ATHIS data and the synthetic CCHS data. The outcome variable was CVH, calculated through a modified CANHEART index in both countries. The utility of the pooled dataset was evaluated by comparing the regression model with the model constructed from federated analysis using DataSHIELD. A significant time elapsed to set-up the necessary servers in multiple locations with the requisite security protocols for the federated analysis. For assessing Privacy Risks of Synthetic Data, the largest membership disclosure F1 score across different attack datasets was 0.001, indicating low privacy risk. A comparison of the marginal distributions between males and females showed consistent results in the federated and pooled analyses of synthetic data. In the multivariate analysis of the main effects, the parameter estimates of the federated and pooled analysis were directionally the same as for the univariate analysis. In the multivariate analyses considering the country interactions to determine whether country moderates the relationship between the other variables and CVH, the impact of several factors differed between countries (Table 1). The datasets used were from the Canadian Community Health Survey (CCHS), and the Austrian Health Interview Survey (ATHIS) in 2014. Only CCHS dataset was synthesized using sequential classification and regression trees. The privacy of the CCHS synthetic data was assessed using a membership disclosure test and F1 score. The low value means that the dataset can be deemed as having low privacy risks. Once it was deemed to be non-personal information, the synthetic dataset was sent to the Austrian team for pooling and analysis. The analysis was performed on the pooled source ATHIS data and the synthetic CCHS data. The outcome variable was CVH, calculated through a modified CANHEART index in both countries. The utility of the pooled dataset was evaluated by comparing the regression model with the model constructed from federated analysis using DataSHIELD. A significant time elapsed to set-up the necessary servers in multiple locations with the requisite security protocols for the federated analysis. For assessing Privacy Risks of Synthetic Data, the largest membership disclosure F1 score across different attack datasets was 0.001, indicating low privacy risk. A comparison of the marginal distributions between males and females showed consistent results in the federated and pooled analyses of synthetic data. In the multivariate analysis of the main effects, the parameter estimates of the federated and pooled analysis were directionally the same as for the univariate analysis. In the multivariate analyses considering the country interactions to determine whether country moderates the relationship between the other variables and CVH, the impact of several factors differed between countries (Table 1). Conclusion
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要