GPU Acceleration of Document Similarity Measures for Automated Bug Triaging

Tim Dunn,Natasha Kholgade Banerjee,Sean Banerjee

2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)（2016）

引用 3|浏览2

暂无评分

摘要

Large-scale open source software bug repositories from companies such as Mozilla, RedHat, Novell and Eclipse have enabled researchers to develop automated solutions to bug triaging problems such as bug classification, duplicate classification and developer assignment. However, despite the repositories containing millions of usable reports, researchers utilize only a small fraction of the data. A major reason for this is the polynomial time and cost associated with making comparisons to all prior reports. Graphics processing units (GPUs) with several thousand cores have been used to accelerate algorithms in several domains, such as computer graphics, computer vision and linguistics. However, they have remained unexplored in the area of bug triaging. In this paper, we demonstrate that the problem of comparing a bug report to all prior reports is an embarassingly parallel problem, that can be accelerated using graphics processing unit (GPUs). Comparing the similarity of two bug reports can be performed using frequency based methods (e.g. cosine similarity and BM25F), sequence based methods (e.g. longest common substring and longest common subsequence) or topic modeling. For the purpose of this paper we focus on cosine similarity, longest common substring and longest common subsequence. Using an NVIDIA Tesla K40 GPU, we show that frequency and sequence based similarity measures are accelerated by 89 and 85 times respectively when compared to a pure CPU based implementation. Thus, allowing us to generate similarity scores for the entire Eclipse repository, consisting of 498,161 reports in under a day, as opposed to 83.4 days using a CPU based approach.

查看译文

关键词

GPU accelerated document similarity,open source repositories,cosine similarity,longest common subsequence,longest common substring

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要