AggFirstJoin: Optimizing Geo-Distributed Joins using Aggregation-Based Transformations

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)(2023)

Cited 1|Views18
No score
Geo-distributed analytics (GDA) involves processing of data stored across geographically distributed sites. Such analytics involves data transfer over the wide area network (WAN) links. WAN links are highly constrained and heterogeneous in nature, making the data transfer over the WAN slow and costly. To tackle this issue, recent approaches have proposed WAN-aware scheduling and placement of geo-distributed analytics tasks. However, computing joins in a geo-distributed setting remains a challenging problem. In this work, we propose AggFirstJoin, an approach to minimize the cost of geo-distributed joins using a theoretically sound query transformation technique. Our optimization approach takes a combined view of the join and aggregation operations which are often part of the same query and pushes (a transformed) aggregation before join in a manner to produce the same results as the original query. We augment our query transformation technique with a WAN-aware task placement and a Bloom filtering approach to further reduce query execution time and WAN usage respectively. We implement our proposed technique on top of Apache Spark, a popular engine for big data analytics. We extensively evaluate our proposed technique using synthetic, TPC-H and Amplab Big Data benchmark datasets on a real geo-distributed testbed on AWS as well as an emulated testbed. Our evaluations show our proposed technique achieves up to 300x reduction in query execution time and 200x reduction in WAN usage as compared to state-of-the-art GDA techniques.
Translated text
Key words
geo distributed analytics,edge cloud infrastructure,data join,data aggregation
AI Read Science
Must-Reading Tree
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined