Compressed data direct computing for Chinese dataset on DCU

CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING(2023)

引用 0|浏览10
暂无评分
摘要
In the era of big data, information is growing at an explosive rate and shows a variety of characteristics. Accordingly, how to scientifically and efficiently manage and analyze massive amounts of data has become an urgent problem for technical enterprises and government departments. Among all proposed modern techniques to handle data on large scales, text analytics directly on compression (TADOC) stands out with an innovative idea of operating on the compression and has substantial potential in various applications. Meanwhile, DCU (Deep Computing Unit), a new Chinese domestic accelerator with high acceleration performance, exhibits tremendous adaptability in transplanting the work of TADOC. Therefore, this paper proposes D-TADOC, a compressed data direct computing technology for Chinese dataset on DCU, which can effectively process data in Chinese without decompression and visualize the analytics results. There are three key components in D-TADOC. First, we incorporate TADOC with the word segmentation tool in the data preprocessing module, enabling TADOC to analyze not only English, but also Chinese texts. Second, we design the parallel processing module on the DCU architecture. Third, we develop the result visualization module, which supports the user-friendly presentation of the text analytics outcomes. We conduct experiments of D-TADOC on Sugon’s cloud computing service platform with diverse public datasets and evaluate the performance. The experiment results show that D-TADOC achieves an average speedup of 40.5 × compared with the TADOC baseline on the CPU, demonstrating the adaptability of DCU for TADOC tasks as well as the efficiency of D-TADOC on compressed text analytics.
更多
查看译文
关键词
TADOC,Text analytics,DCU,Parallel computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要