G-KNN: an efficient document classification algorithm for sparse datasets on GPUs using KNN

SAC 2015: Symposium on Applied Computing Salamanca Spain April, 2015(2015)

引用 10|浏览56
暂无评分
摘要
In nowadays we observe that there is more data than that can be effectively analyzed. Organizing this data has become one of the biggest problems in Computer Science. Many algorithms have been proposed for this purpose, highlighting those related to the Data Mining area, specifically the automatic document classification (ADC) algorithms. However, these algorithms are still a computational challenge because of the volume of data that needs to be processed. We found in the literature some proposals related to parallelization on graphics processing units (GPUs) to make these algorithms feasible. Still, most of the available parallel solutions ignore specific ADC challenges, such as high dimensionality and heterogeneity in the representation of the documents. In this context, we here present G-KNN, a GPU-based parallel version of the nearest neighbors algorithm (KNN), one of the most widely used ADC algorithms. In our evaluation using five different document collections, we show that the G-KNN can maintain the same classification effectiveness while increasing the efficiency by up to 12x faster than its sequential version using CPU and up to 3x faster than a CPU-based parallel implementation running with 6 threads. Moreover, our algorithm has a much lower memory consumption, enabling its use with large datasets.
更多
查看译文
关键词
Parallel algorithms, Data Mining Applications, GPU
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要