AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
This paper presents epiC, a scalable and extensible system for processing BigData. epiC solves BigData’s data volume challenge by parallelization and tackles the data variety challenge by decoupling the concurrent programming model and the data processing model

epiC: an extensible and scalable system for processing Big Data

VLDB JOURNAL, no. 1 (2016): 3-26

Cited by: 7|Views324
WOS

Abstract

The Big Data problem is characterized by the so-called 3V features: volume-a huge amount of data, velocity-a high data ingestion rate, and variety-a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions to the Big Data problem are largely based on the MapReduce framework (aka its open source i...More

Code:

Data:

0
Introduction
  • Many of today’s enterprises are encountering the Big Data problems. A Big Data problem has three distinct characteristics (so called 3V features): the data volume is huge; the data type is diverse (mixture of structured data, semi-structured data and unstructured data); and the data producing velocity is very high.
  • A Big Data problem has three distinct characteristics: the data volume is huge; the data type is diverse; and the data producing velocity is very high
  • These 3V features pose a grand challenge to traditional data processing systems since these systems either cannot scale to the huge data volume in a cost effective way or fail to handle data with variety of types [3][7].
  • Systems like Dryad [15] and Pregel [20] are built to process those kinds of analytical tasks
Highlights
  • Many of today’s enterprises are encountering the Big Data problems
  • The advantage of MapReduce is that the system tackles the data volume challenge successfully and is resilient to machine failures [8]
  • Since continuously querying server-side information is a common communication pattern in epiC, we develop a new remote procedure call (RPC) scheme to eliminate the pulling interval in successive RPC calls for low latency data processing
  • We found that TTLRPC significantly improves the performance of small jobs and reduces startup costs
  • Instead of using a one-size-fit-all solution, we propose to use different data processing models to process different data and employ a common concurrent programming model to parallelize all those data processing
  • This paper presents epiC, a scalable and extensible system for processing BigData. epiC solves BigData’s data volume challenge by parallelization and tackles the data variety challenge by decoupling the concurrent programming model and the data processing model
Methods
  • The authors evaluate the performance of epiC on different kinds of data processing tasks, including unstructured data processing, relational data processing and graph processing.
  • The authors benchmark epiC against Hadoop, an open source implementation of MapReduce for processing unstructured data and relational data and GPS [24], an open source implementation of Pregel [20] for graph processing, respectively.
  • The authors run additional experiments to benchmark epiC with two new in-memory data processing systems, namely Shark and Impala.
  • The nodes within each rack are connected by a 1 Gbps switch.
  • The two racks are connected by a 10 Gbps cluster switch.
  • Due to the JVM costs, the tested Java program can only read local files at 70 ∼ 80 MB/sec
Results
  • Partial Results of

    Customer/Lineitem

    Master network Partition info of the Partial Results of Customer and JoinView1

    Create Partition JoinView2 as (Customer join JoinView1).
  • Master network Partition info of the Partial Results of Customer and JoinView1.
  • Master network Partition info of the Partial Results of JoinView2.
  • Master network Partition info of Groups AggregateUnit.
  • Partial Results of Group By. SingleTableUnit select * from S partitioned by S.key select * from T partitioned by T.foreignkey Table S and T.
  • Create Partition Tmp2 as(select * from T where T.foreignkey in) Table T and Tmp. JoinUnit select * from S, Tmp2 where S.key =Tmp2.foreignkey Table S and Tmp2
Conclusion
  • This paper presents epiC, a scalable and extensible system for processing BigData. epiC solves BigData’s data volume challenge by parallelization and tackles the data variety challenge by decoupling the concurrent programming model and the data processing model.
  • To handle a multi-structured data, users process each data type with the most appropriate data processing model and wrap those computations in a simple unit interface.
  • Programs written in this way can be automatically executed in parallel by epiC’s concurrent runtime system.
  • The benchmarking of epiC against Hadoop and GPS confirms its efficiency
Related work
  • Big Data processing systems can be classified into the following categories: 1) Parallel Databases, 2) MapReduce based systems, 3) DAG based data processing systems, 4) Actor-like systems and 5) hybrid systems. A comprehensive survey could be found in [19], and a new benchmark called BigBench [13], was also recently proposed to evaluate and compare the performance of different big data processing systems.

    Parallel databases are mainly designed for processing structured data sets where each data (called a record) strictly forms a relational schema [10, 9, 12]. These systems employ data partitioning and partitioned execution techniques for high performance query processing. Recent parallel database systems also employ the columnoriented processing strategy to even improve the performance of analytical workloads such as OLAP queries [26]. Parallel databases have been shown to scale to at least peta-byte dataset but with a relatively high cost on hardware and software [3]. The main drawback of parallel databases is that those system cannot effectively process unstructured data. However, there are recent proposals trying to integrate Hadoop into database systems to mitigate the problem [27]. Our epiC, on the other hand, has been designed and built from scratch to provide the scalability, efficiency and flexibility found in both platforms.
Funding
  • The work in this paper was supported by the Singapore Ministry of Education Grant No R-252-000-454-112
Reference
  • The hadoop offical website. http://hadoop.apache.org/.
    Findings
  • The storm project offical website. http://storm-project.net/.
    Findings
  • A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB, 2(1), Aug. 2009.
    Google ScholarLocate open access versionFindings
  • D. Battre, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In SoCC, 2010.
    Google ScholarLocate open access versionFindings
  • J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In SODA, 1997.
    Google ScholarLocate open access versionFindings
  • Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: efficient iterative data processing on large clusters. VLDB, 3(1-2), Sept. 2010.
    Google ScholarLocate open access versionFindings
  • B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing a SQL implementation on the MapReduce framework. In VLDB, 2011.
    Google ScholarLocate open access versionFindings
  • J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1), Jan. 2008.
    Google ScholarLocate open access versionFindings
  • D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. Gamma - a high performance dataflow database machine. In VLDB, 1986.
    Google ScholarLocate open access versionFindings
  • D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6), 1992.
    Google ScholarLocate open access versionFindings
  • D. J. DeWitt, A. Halverson, R. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split query processing in polybase. In SIGMOD, 2013.
    Google ScholarLocate open access versionFindings
  • S. Fushimi, M. Kitsuregawa, and H. Tanaka. An overview of the system software of a parallel relational database machine grace. In VLDB, 1986.
    Google ScholarLocate open access versionFindings
  • A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen. BigBench: Towards an industry standard benchmark for big data analytics. In SIGMOD, 2013.
    Google ScholarFindings
  • C. Hewitt, P. Bishop, and R. Steiger. A universal modular actor formalism for artificial intelligence. In IJCAI, 1973.
    Google ScholarLocate open access versionFindings
  • M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev., 41(3), Mar. 2007.
    Google ScholarLocate open access versionFindings
  • M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In SIGMOD, 2009.
    Google ScholarLocate open access versionFindings
  • D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of MapReduce: an in-depth study. PVLDB, 3(1-2), Sept. 2010.
    Google ScholarLocate open access versionFindings
  • D. Jiang, A. K. H. Tung, and G. Chen. MAP-JOIN-REDUCE: Toward scalable and efficient data analysis on large clusters. IEEE TKDE, 23(9), Sept. 2011.
    Google ScholarLocate open access versionFindings
  • F. Li, B. C. Ooi, M. T. Ozsu, and S. Wu. Distributed data management using MapReduce. ACM Comput. Surv. (to appear), 46(3), 2014.
    Google ScholarLocate open access versionFindings
  • G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010.
    Google ScholarLocate open access versionFindings
  • L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In ICDMW, 2010.
    Google ScholarLocate open access versionFindings
  • L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, November 1999.
    Google ScholarFindings
  • A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009.
    Google ScholarLocate open access versionFindings
  • S. Salihoglu and J. Widom. GPS: A graph processing system. In SSDBM (Technical Report), 2013.
    Google ScholarFindings
  • R. Sinha and J. Zobel. Cache-conscious sorting of large sets of strings with dynamic tries. J. Exp. Algorithmics, 9, Dec. 2004.
    Google ScholarLocate open access versionFindings
  • M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O’Neil, P. O’Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: a column-oriented dbms. In VLDB, 2005.
    Google ScholarLocate open access versionFindings
  • X. Su and G. Swart. Oracle in-database Hadoop: when MapReduce meets RDBMS. In SIGMOD, 2012.
    Google ScholarLocate open access versionFindings
  • A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. VLDB, 2(2), Aug. 2009.
    Google ScholarLocate open access versionFindings
  • S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query optimization for massively parallel data processing. In SoCC, 2011.
    Google ScholarLocate open access versionFindings
  • H. Yang, A. Dasdan, R. Hsiao, and D. S. Parker. Map-Reduce-Merge: simplified relational data processing on large clusters. In SIGMOD, 2007.
    Google ScholarLocate open access versionFindings
  • M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI’12, 2012.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科