AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
This paper presents epiC, a scalable and extensible system for processing BigData. epiC solves BigData’s data volume challenge by parallelization and tackles the data variety challenge by decoupling the concurrent programming model and the data processing model
epiC: an extensible and scalable system for processing Big Data
VLDB JOURNAL, no. 1 (2016): 3-26
The Big Data problem is characterized by the so-called 3V features: volume-a huge amount of data, velocity-a high data ingestion rate, and variety-a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions to the Big Data problem are largely based on the MapReduce framework (aka its open source i...More
PPT (Upload PPT)
- Many of today’s enterprises are encountering the Big Data problems. A Big Data problem has three distinct characteristics (so called 3V features): the data volume is huge; the data type is diverse (mixture of structured data, semi-structured data and unstructured data); and the data producing velocity is very high.
- A Big Data problem has three distinct characteristics: the data volume is huge; the data type is diverse; and the data producing velocity is very high
- These 3V features pose a grand challenge to traditional data processing systems since these systems either cannot scale to the huge data volume in a cost effective way or fail to handle data with variety of types .
- Systems like Dryad  and Pregel  are built to process those kinds of analytical tasks
- Many of today’s enterprises are encountering the Big Data problems
- The advantage of MapReduce is that the system tackles the data volume challenge successfully and is resilient to machine failures 
- Since continuously querying server-side information is a common communication pattern in epiC, we develop a new remote procedure call (RPC) scheme to eliminate the pulling interval in successive RPC calls for low latency data processing
- We found that TTLRPC significantly improves the performance of small jobs and reduces startup costs
- Instead of using a one-size-fit-all solution, we propose to use different data processing models to process different data and employ a common concurrent programming model to parallelize all those data processing
- This paper presents epiC, a scalable and extensible system for processing BigData. epiC solves BigData’s data volume challenge by parallelization and tackles the data variety challenge by decoupling the concurrent programming model and the data processing model
- The authors evaluate the performance of epiC on different kinds of data processing tasks, including unstructured data processing, relational data processing and graph processing.
- The authors benchmark epiC against Hadoop, an open source implementation of MapReduce for processing unstructured data and relational data and GPS , an open source implementation of Pregel  for graph processing, respectively.
- The authors run additional experiments to benchmark epiC with two new in-memory data processing systems, namely Shark and Impala.
- The nodes within each rack are connected by a 1 Gbps switch.
- The two racks are connected by a 10 Gbps cluster switch.
- Due to the JVM costs, the tested Java program can only read local files at 70 ∼ 80 MB/sec
- Partial Results of
Master network Partition info of the Partial Results of Customer and JoinView1
Create Partition JoinView2 as (Customer join JoinView1).
- Master network Partition info of the Partial Results of Customer and JoinView1.
- Master network Partition info of the Partial Results of JoinView2.
- Master network Partition info of Groups AggregateUnit.
- Partial Results of Group By. SingleTableUnit select * from S partitioned by S.key select * from T partitioned by T.foreignkey Table S and T.
- Create Partition Tmp2 as(select * from T where T.foreignkey in) Table T and Tmp. JoinUnit select * from S, Tmp2 where S.key =Tmp2.foreignkey Table S and Tmp2
- This paper presents epiC, a scalable and extensible system for processing BigData. epiC solves BigData’s data volume challenge by parallelization and tackles the data variety challenge by decoupling the concurrent programming model and the data processing model.
- To handle a multi-structured data, users process each data type with the most appropriate data processing model and wrap those computations in a simple unit interface.
- Programs written in this way can be automatically executed in parallel by epiC’s concurrent runtime system.
- The benchmarking of epiC against Hadoop and GPS confirms its efficiency
- Big Data processing systems can be classified into the following categories: 1) Parallel Databases, 2) MapReduce based systems, 3) DAG based data processing systems, 4) Actor-like systems and 5) hybrid systems. A comprehensive survey could be found in , and a new benchmark called BigBench , was also recently proposed to evaluate and compare the performance of different big data processing systems.
Parallel databases are mainly designed for processing structured data sets where each data (called a record) strictly forms a relational schema [10, 9, 12]. These systems employ data partitioning and partitioned execution techniques for high performance query processing. Recent parallel database systems also employ the columnoriented processing strategy to even improve the performance of analytical workloads such as OLAP queries . Parallel databases have been shown to scale to at least peta-byte dataset but with a relatively high cost on hardware and software . The main drawback of parallel databases is that those system cannot effectively process unstructured data. However, there are recent proposals trying to integrate Hadoop into database systems to mitigate the problem . Our epiC, on the other hand, has been designed and built from scratch to provide the scalability, efficiency and flexibility found in both platforms.
- The work in this paper was supported by the Singapore Ministry of Education Grant No R-252-000-454-112
- The hadoop offical website. http://hadoop.apache.org/.
- The storm project offical website. http://storm-project.net/.
- A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB, 2(1), Aug. 2009.
- D. Battre, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In SoCC, 2010.
- J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In SODA, 1997.
- Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: efficient iterative data processing on large clusters. VLDB, 3(1-2), Sept. 2010.
- B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing a SQL implementation on the MapReduce framework. In VLDB, 2011.
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1), Jan. 2008.
- D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. Gamma - a high performance dataflow database machine. In VLDB, 1986.
- D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6), 1992.
- D. J. DeWitt, A. Halverson, R. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split query processing in polybase. In SIGMOD, 2013.
- S. Fushimi, M. Kitsuregawa, and H. Tanaka. An overview of the system software of a parallel relational database machine grace. In VLDB, 1986.
- A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen. BigBench: Towards an industry standard benchmark for big data analytics. In SIGMOD, 2013.
- C. Hewitt, P. Bishop, and R. Steiger. A universal modular actor formalism for artificial intelligence. In IJCAI, 1973.
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev., 41(3), Mar. 2007.
- M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In SIGMOD, 2009.
- D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of MapReduce: an in-depth study. PVLDB, 3(1-2), Sept. 2010.
- D. Jiang, A. K. H. Tung, and G. Chen. MAP-JOIN-REDUCE: Toward scalable and efficient data analysis on large clusters. IEEE TKDE, 23(9), Sept. 2011.
- F. Li, B. C. Ooi, M. T. Ozsu, and S. Wu. Distributed data management using MapReduce. ACM Comput. Surv. (to appear), 46(3), 2014.
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010.
- L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In ICDMW, 2010.
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, November 1999.
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009.
- S. Salihoglu and J. Widom. GPS: A graph processing system. In SSDBM (Technical Report), 2013.
- R. Sinha and J. Zobel. Cache-conscious sorting of large sets of strings with dynamic tries. J. Exp. Algorithmics, 9, Dec. 2004.
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O’Neil, P. O’Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: a column-oriented dbms. In VLDB, 2005.
- X. Su and G. Swart. Oracle in-database Hadoop: when MapReduce meets RDBMS. In SIGMOD, 2012.
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. VLDB, 2(2), Aug. 2009.
- S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query optimization for massively parallel data processing. In SoCC, 2011.
- H. Yang, A. Dasdan, R. Hsiao, and D. S. Parker. Map-Reduce-Merge: simplified relational data processing on large clusters. In SIGMOD, 2007.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI’12, 2012.