AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We show how to use signatures, or bit-strings based on Bloom filters, in a large-scale commercial search engine for better performance

BitFunnel: Revisiting Signatures for Search

SIGIR, pp.605-614, (2017)

Cited by: 40|Views239
EI

Abstract

Since the mid-90s there has been a widely-held belief that signature files are inferior to inverted files for text indexing. In recent years the Bing search engine has developed and deployed an index based on bit-sliced signatures. This index, known as BitFunnel, replaced an existing production system based on an inverted index. The drivi...More

Code:

Data:

Introduction
  • The authors show how to use signatures, or bit-strings based on Bloom filters [1], in a large-scale commercial search engine for better performance.
  • Query Q is said to match document D when every term t ∈ Q is an element of D.
  • This happens when Q ⊆ D or Q = Q ∩ D.
Highlights
  • Commercial search engines [2, 5, 19, 24] traditionally employ inverted indexes
  • Since the mid-90s there has been a widely-held belief that signature files are inferior to inverted files for text indexing
  • In recent years the Bing search engine has developed and deployed an index based on bit-sliced signatures
  • We show how to use signatures, or bit-strings based on Bloom filters [1], in a large-scale commercial search engine for better performance
  • Let corpus C be a set of documents, each of which consists of a set of text terms: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page
Methods
  • Method that Accounts for Non

    Uniform Occurrence and Query Frequencies. In VLDB. 165–170.

    [11] Edward Fox, Donna Harman, w.
  • Method that Accounts for Non.
  • Uniform Occurrence and Query Frequencies.
  • In VLDB.
  • [11] Edward Fox, Donna Harman, w.
  • Ricardo Baeza-Yates.
  • Information retrieval: data structures and algorithms.
  • Prentice Hall PTR, 28–43.
  • [12] Shlomo Geva and Christopher M De Vries.
  • Topsig: Topology preserving document signatures.
  • In Proceedings of the 20th ACM international conference on
Results
  • 5.1.1 Signal in a Higher Rank Row. Because each bit in a higher rank row corresponds to multiple documents, the bit density contributed by a single term will nearly always be greater in higher rank rows.
  • Because each bit in a higher rank row corresponds to multiple documents, the bit density contributed by a single term will nearly always be greater in higher rank rows
  • The authors can see this in Figure 4 where densities in the rank-0 and 8 , respectively.
Conclusion
  • This work revisits bit-sliced signatures and describes their use in a commercial search engine, which previously used inverted files.

    Signature-based approaches introduce several challenges and the authors develop a set of techniques to reduce the memory footprint and to process queries quickly.
  • This work revisits bit-sliced signatures and describes their use in a commercial search engine, which previously used inverted files.
  • Signature-based approaches introduce several challenges and the authors develop a set of techniques to reduce the memory footprint and to process queries quickly.
  • The authors derive a performance model that allows expressing the system configuration as an optimization problem.
  • The authors evaluate the key techniques behind BitFunnel experimentally, and the authors provide the source code publicly to accelerate advances in this area
Tables
  • Table1: Corpora. ABC 64 128 256
  • Table2: Impact of BitFunnel Innovations
  • Table3: Query Processing Performance
Download tables as Excel
Reference
  • Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422–426.
    Google ScholarLocate open access versionFindings
  • Sergey Brin and Lawrence Page. 1998. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30, 1-7 (1998), 107–117.
    Google ScholarLocate open access versionFindings
  • Jehoshua Bruck, Jie Gao, and Anxiao Jiang. 2006. Weighted bloom filter. In 2006 IEEE International Symposium on Information Theory. IEEE.
    Google ScholarLocate open access versionFindings
  • Stefan Büttcher, Charles LA Clarke, and Gordon V Cormack. 2016. Information retrieval: Implementing and evaluating search engines. Mit Press.
    Google ScholarFindings
  • Berkant Barla Cambazoglu and Ricardo A. Baeza-Yates. 201Scalability Challenges in Web Search Engines. Morgan & Claypool Publishers.
    Google ScholarFindings
  • J Shane Culpepper and Alistair Moffat. 2010. Efficient set intersection for inverted indexing. ACM Transactions on Information Systems (TOIS) 29, 1 (2010), 1.
    Google ScholarLocate open access versionFindings
  • Bolin Ding and Arnd Christian König. 2011. Fast set intersection in memory. Proceedings of the VLDB Endowment 4, 4 (2011), 255–266.
    Google ScholarLocate open access versionFindings
  • Chris Faloutsos. 1985. Access methods for text. ACM Computing Surveys (CSUR) 17, 1 (1985), 49–74.
    Google ScholarLocate open access versionFindings
  • Christos Faloutsos. 1992. Information retrieval: data structures and algorithms. Prentice Hall PTR, 44–65.
    Google ScholarFindings
  • Christos Faloutsos and Stavros Christodoulakis. 1985. Design of a Signature File Method that Accounts for Non-Uniform Occurrence and Query Frequencies.. In VLDB. 165–170.
    Google ScholarLocate open access versionFindings
  • Edward Fox, Donna Harman, w. Lee, and Ricardo Baeza-Yates. 1992. Information retrieval: data structures and algorithms. Prentice Hall PTR, 28–43.
    Google ScholarFindings
  • Shlomo Geva and Christopher M De Vries. 2011. Topsig: Topology preserving document signatures. In Proceedings of the 20th ACM international conference on Information and knowledge management. ACM, 333–338.
    Google ScholarLocate open access versionFindings
  • Andrew Kane and Frank Wm Tompa. 2014. Skewed partial bitvectors for list intersection. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 263–272.
    Google ScholarLocate open access versionFindings
  • A Kent, Ron Sacks-Davis, and Kotagiri Ramamohanarao. 1990. A signature file scheme based on multiple organizations for indexing very large text databases. Journal of the American Society for Information Science 41, 7 (1990), 508.
    Google ScholarLocate open access versionFindings
  • Donald E Knuth. 1998. The Art of Computer Programming, Vol. 3, Sorting and Searching (2nd ed.). Vol. 3. Addison-Wesley, 567–573.
    Google ScholarLocate open access versionFindings
  • Roberto Konow, Gonzalo Navarro, Charles LA Clarke, and Alejandro López-Ortíz. 2013. Faster and smaller inverted indices with treaps. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 193–202.
    Google ScholarLocate open access versionFindings
  • Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (2015), 1–29.
    Google ScholarLocate open access versionFindings
  • Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. Toward reproducible baselines: The open-source ir reproducibility challenge. In European Conference on Information Retrieval. Springer, 408–420.
    Google ScholarLocate open access versionFindings
  • Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. 2001. Building a distributed full-text index for the Web. In Proceedings of the Tenth International World Wide Web Conference, WWW 10, Hong Kong, China, May 1-5, 2001. 396–406.
    Google ScholarLocate open access versionFindings
  • Alistair Moffat and Justin Zobel. 1996. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems (TOIS) 14, 4 (1996), 349–379.
    Google ScholarLocate open access versionFindings
  • Calvin N Mooers. 1948. Application of random codes to the gathering of statistical information. Ph.D. Dissertation. Massachusetts Institute of Technology.
    Google ScholarFindings
  • Calvin N Mooers. 1951. Zatocoding applied to mechanical organization of knowledge. American documentation 2, 1 (1951), 20–32.
    Google ScholarFindings
  • Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned elias-fano indexes. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 273–282.
    Google ScholarLocate open access versionFindings
  • Knut Magne Risvik, Trishul M. Chilimbi, Henry Tan, Karthik Kalyanaraman, and Chris Anderson. 2013. Maguro, a system for indexing and searching over very large text collections. In Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, Rome, Italy, February 4-8, 2013. 727–736.
    Google ScholarLocate open access versionFindings
  • Charles S Roberts. 1979. Partial-match retrieval via the method of superimposed codes. Proc. IEEE 67, 12 (1979), 1624–1642.
    Google ScholarLocate open access versionFindings
  • Ron Sacks-Davis, A Kent, and Kotagiri Ramamohanarao. 1987. Multikey access methods based on superimposed coding techniques. ACM Transactions on Database Systems (TODS) 12, 4 (1987), 655–696.
    Google ScholarLocate open access versionFindings
  • Harry KT Wong, Hsiu-Fen Liu, Frank Olken, Doron Rotem, and Linda Wong. 1985. Bit Transposed Files.. In VLDB, Vol. 85.
    Google ScholarLocate open access versionFindings
  • Justin Zobel, Alistair Moffat, and Kotagiri Ramamohanarao. 1998. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems (TODS) 23, 4 (1998), 453–490.
    Google ScholarLocate open access versionFindings
Author
Bob Goodwin
Bob Goodwin
Dan Luu
Dan Luu
Alex Clemmer
Alex Clemmer
Mihaela Curmei
Mihaela Curmei
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科