Towards Scalable Dataframe Systems

PROCEEDINGS OF THE VLDB ENDOWMENT(2020)

引用 101|浏览192
暂无评分
摘要
Dataframes are a popular and convenient abstraction to represent, structure, clean, and analyze data during exploratory data analysis. Despite the success of dataframe libraries in R and Python (pandas), dataframes face performance issues even on moderately large datasets. In this vision paper, we take the first steps towards formally defining dataframes, characterizing their properties, and outlining a research agenda towards making dataframes more interactive at scale. We draw on tools and techniques from the database community, and describe ways they may be adapted to serve dataframe systems, as well as the new challenges therein. We also describe our current progress toward a scalable dataframe system, Modin, which is already up to 30$times$ faster than pandas in preliminary case studies, while enabling unmodified pandas code to run as-is. In its first 18 months, Modin is already used by over 60 downstream projects, has over 250 forks, and 3,900 stars on GitHub, indicating the pressing need for pursuing this agenda.
更多
查看译文
关键词
scalable dataframe systems
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要