## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of The ACM, no. 1 (2008): 117-122

EI

Keywords

Abstract

We present an algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O\left( {dn^{1/c^2+ o(1)} } \right) and space O\left( {dn + n^{1 + 1/c^2+ o(1)} } \right). This almost matches the lower bound for hashing-based algorithm recently obtained in [27]. We also obtain a space-effi...More

Code:

Data:

Introduction

- The nearest neighbor problem is defined as follows: given a collection of n points, build a data structure which, given any query point, reports the data point that is closest to the query.
- In Section 4, the authors describe a recently developed LSH family for the Euclidean distance, which achievies a near-optimal separation between the collision probabilities of close and far points.
- An interesting feature of this family is that it effectively enables the reduction of the approximate nearest neighbor problem for worst-case data to the exact nearest neighbor problem over random point configuration in low-dimensional spaces.

Highlights

- The nearest neighbor problem is defined as follows: given a collection of n points, build a data structure which, given any query point, reports the data point that is closest to the query
- A interesting and well-studied instance is where the data points live in a d-dimensional space under some (e.g., Euclidean) distance function. This problem is of major importance in several areas; some examples are data compression, databases and data mining, information retrieval, image and video databases, machine learning, pattern recognition, statistics and data analysis
- The basic problem is to perform indexing or similarity searching for query objects
- We focus on one of the most popular algorithms for performing approximate search in high dimensions based on the concept of locality-sensitive hashing (LSH) [25]
- In the rest of this article, we focus on the approximate near neighbor problem

Results

- The authors' algorithm either returns one of the R-near neighbors or concludes that no such point exists for some parameter R.
- Given a set P of points in a d-dimensional space ޒd, and parameters R > 0, ␦ > 0, construct a data structure such that, given any query point q, if there exists an R-near neighbor of q in P, it reports some cR-near neighbor of q in P with probability 1 – ␦.
- Given a set P of points in a d-dimensional space ޒd, and parameters R > 0, ␦ > 0, construct a data structure that, given any query point q, reports each R-near neighbor of q in P with probability 1 – ␦.
- Unlike the case of the approximate near neighbor, here the data structure can return many points if a large fraction of the data points are located close to the query point.
- LЈ = 3L, yields a solution to the randomized c-approximate R-near neighbor problem, with parameters R and ␦ for some constant failure probability ␦ < 1.
- Choose L functions gj, j = 1,...L, by setting gj = (h1, j, h2, j,...hk, j), where h1, j,...hk, j are chosen at random from the LSH family H.
- The blue function (k = 1) is the probability of collision of points p and q under a single random hash function h from the LSH family.
- There the data structure optimized the parameter k as a function of the dataset and a set of sample queries.

Conclusion

- The authors present a new LSH family, yielding an algorithm with query time exponent (c) = 1/c2 + O.
- The hash functions were projecting the vectors on some subset of the coordinates {1...d} as in the example from an earlier section.
- To measure the similarity between two sets A and B, the authors of [9, 8] considered the Jaccard coefficient s(A, B), proposing a family of hash functions h(A) such that Pr[h(A) = h(B)] = s(A, B).

Related work

- In this section, we give a brief overview of prior work in the spirit of the algorithms considered in this article. We give only high-level simplified descriptions of the algorithms to avoid area-specific terminology. Some of the papers considered a closely related problem of finding all close pairs of points in a dataset. For simplicity, we translate them into the near neighbor framework since they can be solved by performing essentialy n separate near neighbor queries.

Hamming distance. Several papers investigated multi-index hashingbased algorithms for retrieving similar pairs of vectors with respect to the Hamming distance. Typically, the hash functions were projecting the vectors on some subset of the coordinates {1...d} as in the example from an earlier section. In some papers [33, 21], the authors considered the probabilistic model where the data points are chosen uniformly at random, and the query point is a random point close to one of the points in the dataset. A different approach [26] is to assume that the dataset is arbitrary, but almost all points are far from the query point. Finally, the paper [12] proposed an algorithm which did not make any assumption on the input. The analysis of the algorithm was akin to the analysis sketched at the end of section 2.4: the parameters k and L were chosen to achieve desired level of sensitivity and accuracy.

Funding

- This work was supported in part by NSF CAREER grant CCR-0133849 and David and Lucille Packard Fellowship

Reference

- Ailon, N. and Chazelle, B. 2006. Approximate nearest neighbors and the Fast Johnson-Lindenstrauss Transform. In Proceedings of the Symposium on Theory of Computing.
- Andoni, A. and Indyk, P. 2004. E2lsh: Exact Euclidean localitysensitive hashing. http://web.mit.edu/andoni/www/LSH/.
- Andoni, A. and Indyk, P. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the Symposium on Foundations of Computer Science.
- Andoni, A. and Indyk, P. 2006. Efficient algorithms for substring near neighbor problem. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 1203–1212.
- Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, A. 1994. An optimal algorithm for approximate nearest neighbor searching. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 573–582.
- Bentley, J. L. 1975. Multidimensional binary search trees used for associative searching. Comm. ACM 18, 509–517.
- Broder, A., Charikar, M., Frieze, A., and Mitzenmacher, M. 1998. Min-wise independent permutations. J. Comput. Sys. Sci.
- Broder, A., Glassman, S., Manasse, M., and Zweig, G. 1997. Syntactic clustering of the web. In Proceedings of the 6th International World Wide Web Conference. 391–404.
- Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences. 21–29.
- Buhler, J. 2001. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinform. 17, 419–428.
- Buhler, J. and Tompa, M. 2001. Finding motifs using random projections. In Proceedings of the Annual International Conference on Computational Molecular Biology (RECOMB1).
- Califano, A. and Rigoutsos, I. 1993. Flash: A fast look-up algorithm for string homology. In Proceedings of the IEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Chakrabarti, A. and Regev, O. 2004. An optimal randomised cell probe lower bounds for approximate nearest neighbor searching. In Proceedings of the Symposium on Foundations of Computer Science.
- Charikar, M. 2002. Similarity estimation techniques from rounding. In Proceedings of the Symposium on Theory of Computing.
- Charikar, M., Chekuri, C., Goel, A., Guha, S., and Plotkin, S. 1998. Approximating a finite metric by a small number of tree metrics. In Proceedings of the Symposium on Foundations of Computer Science.
- Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduct. Algorithms. 2nd Ed. MIT Press.
- Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. 2004. Localitysensitive hashing scheme based on p-stable distributions.In Proceedings of the ACM Symposium on Computational Geometry.
- Dutta, D., Guha, R., Jurs, C., and Chen, T. 2006. Scalable partitioning and exploration of chemical spaces using geometric hashing. J. Chem. Inf. Model. 46.
- Gionis, A., Indyk, P., and Motwani, R. 1999. Similarity search in high dimensions via hashing. In Proceedings of the International Conference on Very Large Databases.
- Goemans, M. and Williamson, D. 1995. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42. 1115–1145.
- Greene, D., Parnas, M., and Yao, F. 1994. Multi-index hashing for information retrieval. In Proceedings of the Symposium on Foundations of Computer Science. 722–731.
- Har-Peled, S. 2001. A replacement for voronoi diagrams of near linear size. In Proceedings of the Symposium on Foundations of Computer Science.
- Haveliwala, T., Gionis, A., and Indyk, P. 2000. Scalable techniques for clustering the web. WebDB Workshop.
- Indyk, P. 2003. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Computational Geometry. CRC Press.
- Indyk, P. and Motwani, R. 1998. Approximate nearest neighbor: Towards removing the curse of dimensionality. In Proceedings of the Symposium on Theory of Computing.
- Karp, R. M., Waarts, O., and Zweig, G. 1995. The bit vector intersection problem. In Proceedings of the Symposium on Foundations of Computer Science. pages 621–630.
- Kleinberg, J. 1997. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the Symposium on Theory of Computing.
- Krauthgamer, R. and Lee, J. R. 2004. Navigating nets: Simple algorithms for proximity search. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
- Kushilevitz, E., Ostrovsky, R., and Rabani, Y. 1998. Efficient search for approximate nearest neighbor in high dimensional spaces. In Proceedings of the Symposium on Theory of Computing. 614–623.
- Linial, N., London, E., and Rabinovich, Y. 1994. The geometry of graphs and some of its algorithmic applications. In Proceedings of the Symposium on Foundations of Computer Science. 577–591.
- Motwani, R., Naor, A., and Panigrahy, R. 2006. Lower bounds on locality sensitive hashing. In Proceedings of the ACM Symposium on Computational Geometry.
- Panigrahy, R. 2006. Entropy-based nearest neighbor algorithm in high dimensions. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
- Paturi, R., Rajasekaran, S., and Reif, J.The light bulb problem. Inform. Comput. 117, 2, 187–192.
- Ravichandran, D., Pantel, P., and Hovy, E. 2005. Randomized algorithms and nlp: Using locality sensitive hash functions for high speed noun clustering. In Proceedings of the Annual Meeting of the Association of Computational Linguistics.
- Samet, H. 2006. Foundations of Multidimensional and Metric Data Structures. Elsevier, 2006.
- Shakhnarovich, G., Darrell, T., and Indyk, P. Eds. Nearest Neighbor Methods in Learning and Vision. Neural Processing Information Series, MIT Press.
- Terasawa, T. and Tanaka, Y. 2007. Spherical lsh for approximate nearest neighbor search on unit hypersphere. In Proceedings of the Workshop on Algorithms and Data Structures.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn