AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We focus on one of the most popular algorithms for performing approximate search in high dimensions based on the concept of locality-sensitive hashing

Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of The ACM, no. 1 (2008): 117-122

Cited by: 2257|Views188
EI

Abstract

We present an algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O\left( {dn^{1/c^2+ o(1)} } \right) and space O\left( {dn + n^{1 + 1/c^2+ o(1)} } \right). This almost matches the lower bound for hashing-based algorithm recently obtained in [27]. We also obtain a space-effi...More

Code:

Data:

0
Introduction
  • The nearest neighbor problem is defined as follows: given a collection of n points, build a data structure which, given any query point, reports the data point that is closest to the query.
  • In Section 4, the authors describe a recently developed LSH family for the Euclidean distance, which achievies a near-optimal separation between the collision probabilities of close and far points.
  • An interesting feature of this family is that it effectively enables the reduction of the approximate nearest neighbor problem for worst-case data to the exact nearest neighbor problem over random point configuration in low-dimensional spaces.
Highlights
  • The nearest neighbor problem is defined as follows: given a collection of n points, build a data structure which, given any query point, reports the data point that is closest to the query
  • A interesting and well-studied instance is where the data points live in a d-dimensional space under some (e.g., Euclidean) distance function. This problem is of major importance in several areas; some examples are data compression, databases and data mining, information retrieval, image and video databases, machine learning, pattern recognition, statistics and data analysis
  • The basic problem is to perform indexing or similarity searching for query objects
  • We focus on one of the most popular algorithms for performing approximate search in high dimensions based on the concept of locality-sensitive hashing (LSH) [25]
  • In the rest of this article, we focus on the approximate near neighbor problem
Results
  • The authors' algorithm either returns one of the R-near neighbors or concludes that no such point exists for some parameter R.
  • Given a set P of points in a d-dimensional space ‫ޒ‬d, and parameters R > 0, ␦ > 0, construct a data structure such that, given any query point q, if there exists an R-near neighbor of q in P, it reports some cR-near neighbor of q in P with probability 1 – ␦.
  • Given a set P of points in a d-dimensional space ‫ޒ‬d, and parameters R > 0, ␦ > 0, construct a data structure that, given any query point q, reports each R-near neighbor of q in P with probability 1 – ␦.
  • Unlike the case of the approximate near neighbor, here the data structure can return many points if a large fraction of the data points are located close to the query point.
  • LЈ = 3L, yields a solution to the randomized c-approximate R-near neighbor problem, with parameters R and ␦ for some constant failure probability ␦ < 1.
  • Choose L functions gj, j = 1,...L, by setting gj = (h1, j, h2, j,...hk, j), where h1, j,...hk, j are chosen at random from the LSH family H.
  • The blue function (k = 1) is the probability of collision of points p and q under a single random hash function h from the LSH family.
  • There the data structure optimized the parameter k as a function of the dataset and a set of sample queries.
Conclusion
  • The authors present a new LSH family, yielding an algorithm with query time exponent ␳(c) = 1/c2 + O.
  • The hash functions were projecting the vectors on some subset of the coordinates {1...d} as in the example from an earlier section.
  • To measure the similarity between two sets A and B, the authors of [9, 8] considered the Jaccard coefficient s(A, B), proposing a family of hash functions h(A) such that Pr[h(A) = h(B)] = s(A, B).
Related work
  • In this section, we give a brief overview of prior work in the spirit of the algorithms considered in this article. We give only high-level simplified descriptions of the algorithms to avoid area-specific terminology. Some of the papers considered a closely related problem of finding all close pairs of points in a dataset. For simplicity, we translate them into the near neighbor framework since they can be solved by performing essentialy n separate near neighbor queries.

    Hamming distance. Several papers investigated multi-index hashingbased algorithms for retrieving similar pairs of vectors with respect to the Hamming distance. Typically, the hash functions were projecting the vectors on some subset of the coordinates {1...d} as in the example from an earlier section. In some papers [33, 21], the authors considered the probabilistic model where the data points are chosen uniformly at random, and the query point is a random point close to one of the points in the dataset. A different approach [26] is to assume that the dataset is arbitrary, but almost all points are far from the query point. Finally, the paper [12] proposed an algorithm which did not make any assumption on the input. The analysis of the algorithm was akin to the analysis sketched at the end of section 2.4: the parameters k and L were chosen to achieve desired level of sensitivity and accuracy.
Funding
  • This work was supported in part by NSF CAREER grant CCR-0133849 and David and Lucille Packard Fellowship
Reference
  • Ailon, N. and Chazelle, B. 2006. Approximate nearest neighbors and the Fast Johnson-Lindenstrauss Transform. In Proceedings of the Symposium on Theory of Computing.
    Google ScholarLocate open access versionFindings
  • Andoni, A. and Indyk, P. 2004. E2lsh: Exact Euclidean localitysensitive hashing. http://web.mit.edu/andoni/www/LSH/.
    Findings
  • Andoni, A. and Indyk, P. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the Symposium on Foundations of Computer Science.
    Google ScholarLocate open access versionFindings
  • Andoni, A. and Indyk, P. 2006. Efficient algorithms for substring near neighbor problem. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 1203–1212.
    Google ScholarLocate open access versionFindings
  • Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, A. 1994. An optimal algorithm for approximate nearest neighbor searching. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 573–582.
    Google ScholarLocate open access versionFindings
  • Bentley, J. L. 1975. Multidimensional binary search trees used for associative searching. Comm. ACM 18, 509–517.
    Google ScholarLocate open access versionFindings
  • Broder, A., Charikar, M., Frieze, A., and Mitzenmacher, M. 1998. Min-wise independent permutations. J. Comput. Sys. Sci.
    Google ScholarLocate open access versionFindings
  • Broder, A., Glassman, S., Manasse, M., and Zweig, G. 1997. Syntactic clustering of the web. In Proceedings of the 6th International World Wide Web Conference. 391–404.
    Google ScholarLocate open access versionFindings
  • Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences. 21–29.
    Google ScholarLocate open access versionFindings
  • Buhler, J. 2001. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinform. 17, 419–428.
    Google ScholarLocate open access versionFindings
  • Buhler, J. and Tompa, M. 2001. Finding motifs using random projections. In Proceedings of the Annual International Conference on Computational Molecular Biology (RECOMB1).
    Google ScholarLocate open access versionFindings
  • Califano, A. and Rigoutsos, I. 1993. Flash: A fast look-up algorithm for string homology. In Proceedings of the IEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Chakrabarti, A. and Regev, O. 2004. An optimal randomised cell probe lower bounds for approximate nearest neighbor searching. In Proceedings of the Symposium on Foundations of Computer Science.
    Google ScholarLocate open access versionFindings
  • Charikar, M. 2002. Similarity estimation techniques from rounding. In Proceedings of the Symposium on Theory of Computing.
    Google ScholarLocate open access versionFindings
  • Charikar, M., Chekuri, C., Goel, A., Guha, S., and Plotkin, S. 1998. Approximating a finite metric by a small number of tree metrics. In Proceedings of the Symposium on Foundations of Computer Science.
    Google ScholarLocate open access versionFindings
  • Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduct. Algorithms. 2nd Ed. MIT Press.
    Google ScholarFindings
  • Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. 2004. Localitysensitive hashing scheme based on p-stable distributions.In Proceedings of the ACM Symposium on Computational Geometry.
    Google ScholarLocate open access versionFindings
  • Dutta, D., Guha, R., Jurs, C., and Chen, T. 2006. Scalable partitioning and exploration of chemical spaces using geometric hashing. J. Chem. Inf. Model. 46.
    Google ScholarLocate open access versionFindings
  • Gionis, A., Indyk, P., and Motwani, R. 1999. Similarity search in high dimensions via hashing. In Proceedings of the International Conference on Very Large Databases.
    Google ScholarLocate open access versionFindings
  • Goemans, M. and Williamson, D. 1995. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42. 1115–1145.
    Google ScholarLocate open access versionFindings
  • Greene, D., Parnas, M., and Yao, F. 1994. Multi-index hashing for information retrieval. In Proceedings of the Symposium on Foundations of Computer Science. 722–731.
    Google ScholarLocate open access versionFindings
  • Har-Peled, S. 2001. A replacement for voronoi diagrams of near linear size. In Proceedings of the Symposium on Foundations of Computer Science.
    Google ScholarLocate open access versionFindings
  • Haveliwala, T., Gionis, A., and Indyk, P. 2000. Scalable techniques for clustering the web. WebDB Workshop.
    Google ScholarLocate open access versionFindings
  • Indyk, P. 2003. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Computational Geometry. CRC Press.
    Google ScholarLocate open access versionFindings
  • Indyk, P. and Motwani, R. 1998. Approximate nearest neighbor: Towards removing the curse of dimensionality. In Proceedings of the Symposium on Theory of Computing.
    Google ScholarLocate open access versionFindings
  • Karp, R. M., Waarts, O., and Zweig, G. 1995. The bit vector intersection problem. In Proceedings of the Symposium on Foundations of Computer Science. pages 621–630.
    Google ScholarLocate open access versionFindings
  • Kleinberg, J. 1997. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the Symposium on Theory of Computing.
    Google ScholarLocate open access versionFindings
  • Krauthgamer, R. and Lee, J. R. 2004. Navigating nets: Simple algorithms for proximity search. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
    Google ScholarLocate open access versionFindings
  • Kushilevitz, E., Ostrovsky, R., and Rabani, Y. 1998. Efficient search for approximate nearest neighbor in high dimensional spaces. In Proceedings of the Symposium on Theory of Computing. 614–623.
    Google ScholarLocate open access versionFindings
  • Linial, N., London, E., and Rabinovich, Y. 1994. The geometry of graphs and some of its algorithmic applications. In Proceedings of the Symposium on Foundations of Computer Science. 577–591.
    Google ScholarLocate open access versionFindings
  • Motwani, R., Naor, A., and Panigrahy, R. 2006. Lower bounds on locality sensitive hashing. In Proceedings of the ACM Symposium on Computational Geometry.
    Google ScholarLocate open access versionFindings
  • Panigrahy, R. 2006. Entropy-based nearest neighbor algorithm in high dimensions. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
    Google ScholarLocate open access versionFindings
  • Paturi, R., Rajasekaran, S., and Reif, J.The light bulb problem. Inform. Comput. 117, 2, 187–192.
    Google ScholarLocate open access versionFindings
  • Ravichandran, D., Pantel, P., and Hovy, E. 2005. Randomized algorithms and nlp: Using locality sensitive hash functions for high speed noun clustering. In Proceedings of the Annual Meeting of the Association of Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Samet, H. 2006. Foundations of Multidimensional and Metric Data Structures. Elsevier, 2006.
    Google ScholarFindings
  • Shakhnarovich, G., Darrell, T., and Indyk, P. Eds. Nearest Neighbor Methods in Learning and Vision. Neural Processing Information Series, MIT Press.
    Google ScholarLocate open access versionFindings
  • Terasawa, T. and Tanaka, Y. 2007. Spherical lsh for approximate nearest neighbor search on unit hypersphere. In Proceedings of the Workshop on Algorithms and Data Structures.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科