Similarity Forests.
KDD '17: The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Halifax NS Canada August, 2017(2017)
摘要
Random forests are among the most successful methods used in data mining because of their extraordinary accuracy and effectiveness. However, their use is primarily limited to multidimensional data because they sample features from the original data set. In this paper, we propose a method for extending random forests to work with any arbitrary set of data objects, as long as similarities can be computed among the data objects. Furthermore, since it is understood that similarity computation between all O(n2) pairs of n objects might be expensive, our method computes only a very small fraction of the O(n2) pairwise similarities between objects to construct the forests. Our results show that the proposed similarity forest approach is very efficient and accurate on a wide variety of data sets. Therefore, this paper significantly extends the applicability of random forest methods to arbitrary data domains. Furthermore, the approach even outperforms traditional random forests on multidimensional data. We show that similarity forests are robust to the noisy similarity values that are ubiquitous in real-world applications. In many practical settings, the similarity values between objects are incompletely specified because of the difficulty in collecting such values. Similarity forests can be used in such cases with straightforward modifications.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络