A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)(2017)

引用 4|浏览14
暂无评分
摘要
Computing machine learning models in the cloud remains a central problem in big data analytics. In this work, we introduce a cloud analytic system exploiting a parallel array DBMS based on a classical shared-nothing architecture. Our approach combines in-DBMS data summarization with mathematical processing in an external program. We study how to summarize a data set in parallel assuming a large number of processing nodes and how to further accelerate it with GPUs. In contrast to most big data analytic systems, we do not use Java, HDFS, MapReduce or Spark: our system is programmed in C++ and C on top of a traditional Unix system. In our system, models are efficiently computed using a suite of innovative parallel matrix operators, which compute comprehensive statistical summaries of a large input data set (matrix) in one pass, leaving the remaining mathematically complex computations, with matrices that fit in RAM, to R. In order to be competitive with the Hadoop ecosystem (i.e. HDFS and Spark RDDs) we also introduce a parallel load operator for large matrices and an automated, yet flexible, cluster configuration in the cloud. Experiments compare our system with Spark, showing orders of magnitude time improvement. A GPU with many cores widens the gap further. In summary, our system is a competitive solution.
更多
查看译文
关键词
machine learning,matrix,DBMS,parallel,cloud,GPU
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要