Optimizing large real-world data analysis with parquet files in R: A step-by-step tutorial

PHARMACOEPIDEMIOLOGY AND DRUG SAFETY(2024)

引用 0|浏览3
暂无评分
摘要
Purpose The use of open-source programming languages can facilitate open science practices in real-world evidence (RWE) studies. Real-world studies often rely on using big data, which makes using such languages complicated. We demonstrate an efficient approach that enables RWE researchers to use R to undertake RWE analysis tasks from cohort building to reporting.Methods Using the Merative Marketscan data (2017-2019), we developed an R function to transform the data into parquet format to be used in R. Then, we compared the differences in data size before and after transformation. We compared the performance of the transformed data in R to the original data in terms of numerical consistency and running times required to complete simple exploratory tasks. To show how the transformed databases can be used in practice, we conducted a simplified replication of an active comparator new user study from the literature. All codes are available on GitHub.Results Our approach exhibited high efficiency in data storage, as evidenced by the converted data size, which ranged from 10% to 43% of that of the original data files. The runtime of the exploratory tasks in R generally outperformed that of the original data with SAS. We showed, through example, how the converted data can be efficiently used to implement an RWE study.Conclusion We demonstrate a free and efficient solution to facilitate the use of open-source programming languages with big real-world databases, which can facilitate the adoption of open science practices.
更多
查看译文
关键词
big data,cohort building,open science,pharmacoepidemiology,R,real-world data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要