Pushing diversity into higher dimensions: The LID effect on diversified similarity searching.

Inf. Syst.(2023)

引用 2|浏览3
暂无评分
摘要
Diversified similarity queries extend similarity searches by retrieving data elements simultaneously similar to a given object (reference) but dissimilar to other elements in the result set. Although diversity-related querying routines are expected to be costlier than similarity searches, e.g., k-NN vs. diversified k-NN, experimental evaluations of their quality and performance for exploring datasets embedded in high-dimensional spaces are still open issues. In this manuscript, we apply the concept of Local Intrinsic Dimensionality (LID) to examine the behavior of the method diversity browsing in the exploration of high-dimensional data. Diversity browsing extends the k-NN incremental search algorithm distance browsing towards diversified similarity queries by using both proximity and inner dissimilarity criteria (called Influence rule) to dismiss result set candidates. We empirically investigate the effect of real-world data LIDs over diversity browsing and found shreds of evidence that indicate: (i) the amount of diversity (measured by the total of “retrievable” diversified neighbors) is constrained by both the data fold LID and the Influence-based partitioning of the search space. Counter-intuitively, this number increases with the LID because inner dissimilarities between diversified neighbors grow slower in intrinsically high-dimensional spaces, which generates an inverse correlation between Influence-based pruning and data LID; (ii) exploratory diversity searches (the retrieval of all diversified neighbors) over varying LID sets produce query-based manifolds as the recovered elements are also data samples with enhanced measures of Relative Contrast, Relative Variance, and Intrinsic Dimensionality in comparison to the original dataset. Since exploratory diversity searches preserve the viewpoint of the query object, their outputs also provide an alternative for data visualization with meaningful distance relationships. We illustrate such a data visualization potential with bubble-clustered scatter plots by using Influence rules to determine the size of each cluster; (iii) the tuning of metric indexes is still a relevant factor for speeding up diversified similarity queries over low and medium LID data folds. In particular, our evaluation showed that metric pivots chosen with Maximal Variance criteria outperform Random pivots regarding diversity browsing searches implemented over VP-Tree indexes for most LID ranges; and (iv) the performance difference from diversity browsing to distance browsing reduces with the LID, which suggests that the querying of high-dimensional datasets by diversity may provide richer results at proportional costs to those of similarity searches.
更多
查看译文
关键词
Result diversification,Diversified similarity searching,k-NN,Local Intrinsic Dimensionality
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要