Distributed Gibbs Sampling and LDA Modelling for Large Scale Big Data Management on PySpark

2022 7th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM)(2022)

引用 2|浏览10
暂无评分
摘要
Big data management methods are paramount in the modern era as applications tend to create massive amounts of data that comes from various sources. Therefore, there is an urge to create adaptive, speedy and robust frameworks that can effectively handle massive datasets. Distributed environments such as Apache Spark are of note, as they can handle such data by creating clusters where a portion of the data is stored locally and then the results are returned with the use of Resilient Distributed Datasets (RDDs). In this paper a method for distributed marginal Gibbs sampling for widely used latent Dirichlet allocation (LDA) model is implemented on PySpark along with a Metropolis Hastings Random Walker. The Distributed LDA (DLDA) algorithm distributes a given dataset into P partitions and performs local LDA on each partition, for each document independently. Every n th iteration, local LDA models, that were trained on distinct partitions, are combined to assure the model ability to converge. Experimental results are promising as the proposed system demonstrates comparable performance in the final model quality to the sequential LDA, and achieves significant speedup time-optimizations when utilized with massive datasets.
更多
查看译文
关键词
Distributed Gibbs Sampling,Random Walker,Metropolis Hastings,LDA,Big Data Management,PySpark
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要