Large Scale Distributed Data Science Using Apache Spark

KDD(2015)

引用 224|浏览143
暂无评分
摘要
Apache Spark is an open-source cluster computing framework for big data processing. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce's linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala (and shortly R), and its core data abstraction, the distributed data frame, and it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing.This tutorial will provide an accessible introduction to Spark and its potential to revolutionize academic and commercial data science practices.
更多
查看译文
关键词
Distributed Systems,Hadoop,HDFS,Map Reduce,Spark,Large Scale Machine Learning,Data Science
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要