Three-level caching for efficient query processing in large Web search engines

    WWW, pp. 257-266, 2005.

    Cited by: 191|Bibtex|Views9|Links
    EI
    Keywords:
    web searchsearch engine architecturesearch engine query processinginverted indexcaching
    Wei bo:
    We have proposed a new three-level caching architecture for web search engines that can improve query throughput

    Abstract:

    Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines...More

    Code:

    Data:

    0
    Introduction
    • Due to the rapid growth of the Web from a few thousand pages in 1993 to its current size of several billion pages, users increasingly depend on web search engines for locating relevant information.
    • Even with the construction of optimized index structures, each user query requires a significant amount of data processing on average.
    • To deal with this workload, search engines are typically implemented on large clusters of hundreds or thousands of servers, and techniques such as index compression, caching, and result presorting and query pruning are used to increase throughput and decrease overall cost
    Highlights
    • Due to the rapid growth of the Web from a few thousand pages in 1993 to its current size of several billion pages, users increasingly depend on web search engines for locating relevant information
    • Even with the construction of optimized index structures, each user query requires a significant amount of data processing on average. To deal with this workload, search engines are typically implemented on large clusters of hundreds or thousands of servers, and techniques such as index compression, caching, and result presorting and query pruning are used to increase throughput and decrease overall cost
    • In the first part of our experimental evaluation, we report results in terms of “logical” disk block accesses, including disk reads in query processing and disk writes for adding projections to the cache, but ignoring the caching of lists in main memory
    • The costs are stated as the average number of blocks scanned for each query that is not filtered out by result caching, without list caching which will further improve performance
    • We have proposed a new three-level caching architecture for web search engines that can improve query throughput
    • The architecture introduces a new intermediate caching level for search engines with AND query semantics that can exploit redundancies in the query stream that are not captured by result and list caching in two-level architectures
    Results
    • Results for the greedy algorithm

      The authors present results for the basic versions of the greedy and Landlord algorithms.
    • There are two different ways in which this approach could be used: (1) After analyzing the queries in the training window, the authors could preload the projection cache with the projection selected by the greedy algorithm.
    • This could be done say once a day during the night in a large bulk operation in order to improve performance during peak hours.
    • This could be done say once a day during the night in a large bulk operation in order to improve performance during peak hours. (2) The second approach is to create the selected projections only when the authors encounter the corresponding pair in the evaluation window
    Conclusion
    • OF RELATED WORK

      For more background on indexing and query execution in IR and search engines, see [3, 5, 37].
    • One could study approximation results for the greedy heuristic, or competitive ratios for the Landlord approach in the scenario, or look at the case where the authors include the cost of generating projections into the corresponding weighted caching problem
    • Another interesting theoretical question concerns the performance of caching schemes on certain classes of input sequences, e.g., sequences that follow Zipf distributions on term frequencies
    Summary
    • Introduction:

      Due to the rapid growth of the Web from a few thousand pages in 1993 to its current size of several billion pages, users increasingly depend on web search engines for locating relevant information.
    • Even with the construction of optimized index structures, each user query requires a significant amount of data processing on average.
    • To deal with this workload, search engines are typically implemented on large clusters of hundreds or thousands of servers, and techniques such as index compression, caching, and result presorting and query pruning are used to increase throughput and decrease overall cost
    • Results:

      Results for the greedy algorithm

      The authors present results for the basic versions of the greedy and Landlord algorithms.
    • There are two different ways in which this approach could be used: (1) After analyzing the queries in the training window, the authors could preload the projection cache with the projection selected by the greedy algorithm.
    • This could be done say once a day during the night in a large bulk operation in order to improve performance during peak hours.
    • This could be done say once a day during the night in a large bulk operation in order to improve performance during peak hours. (2) The second approach is to create the selected projections only when the authors encounter the corresponding pair in the evaluation window
    • Conclusion:

      OF RELATED WORK

      For more background on indexing and query execution in IR and search engines, see [3, 5, 37].
    • One could study approximation results for the greedy heuristic, or competitive ratios for the Landlord approach in the scenario, or look at the case where the authors include the cost of generating projections into the corresponding weighted caching problem
    • Another interesting theoretical question concerns the performance of caching schemes on certain classes of input sequences, e.g., sequences that follow Zipf distributions on term frequencies
    Tables
    • Table1: Cost of online projection creation in 4KB block writes per query, for various amounts of cache space in GB
    Download tables as Excel
    Reference
    • V. Anh, O. Kretser, and A. Moffat. Vector-space ranking with effective early termination. In Proc. of the 24th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 35–42, Sept. 2001.
      Google ScholarLocate open access versionFindings
    • V. Anh and A. Moffat. Compressed inverted files with reduced decoding overheads. In Proc. 21st Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 290–297, 1998.
      Google ScholarLocate open access versionFindings
    • A. Arasu, J. Cho, H. Garcia-Molina, and S. Raghavan. Searching the web. ACM Transactions on Internet Technologies, 1(1), June 2001.
      Google ScholarLocate open access versionFindings
    • C. Badue, R. Baeza-Yates, B. Ribeiro-Neto, and N. Ziviani. Distributed query processing using partitioned inverted files. In Proc. of the 9th String Processing and Information Retrieval Symposium (SPIRE), Sept. 2002.
      Google ScholarLocate open access versionFindings
    • R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addision Wesley, 1999.
      Google ScholarLocate open access versionFindings
    • D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. In Proc. of the 25th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 215–221, 2002.
      Google ScholarLocate open access versionFindings
    • B. Bhattacharjee, S. Chawathe, V. Gopalakrishnan, P. Keleher, and B. Silaghi. Efficient peer-to-peer searches using result-caching. In Proc. of the 2nd Int. Workshop on Peer-to-Peer Systems, 2003.
      Google ScholarLocate open access versionFindings
    • E. Brewer. Lessons from giant scale services. IEEE Internet Computing, pages 46–55, August 2001.
      Google ScholarLocate open access versionFindings
    • S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the Seventh World Wide Web Conference, 1998.
      Google ScholarLocate open access versionFindings
    • A. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences, pages 21–29. IEEE Computer Society, 1997.
      Google ScholarLocate open access versionFindings
    • P. Cao and S. Irani. Cost-aware WWW proxy caching algorithms. In USENIX Symp. on Internet Technologies and Systems (USITS), 1997.
      Google ScholarLocate open access versionFindings
    • S. Chaudhuri and L. Gravano. Optimizing queries over multimedia repositories. Data Engineering Bulletin, 19(4):45–52, 1996.
      Google ScholarLocate open access versionFindings
    • E. Demaine, A. Lopez-Ortiz, and J. Munro. Adaptive set intersections, unions, and differences. In Proc. of the 11th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 743–752, 2000.
      Google ScholarLocate open access versionFindings
    • R. Fagin. Combining fuzzy information from multiple systems. In Proc. of ACM Symp. on Principles of Database Systems, 1996.
      Google ScholarLocate open access versionFindings
    • R. Fagin, D. Carmel, D. Cohen, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proc. of the 24th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 43–50, Sept. 2001.
      Google ScholarLocate open access versionFindings
    • R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Proc. of ACM Symp. on Principles of Database Systems, 2001.
      Google ScholarLocate open access versionFindings
    • M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP Completeness. WH Freeman and Company, 1979.
      Google ScholarLocate open access versionFindings
    • T. Haveliwala. Topic-sensitive pagerank. In Proc. of the 11th Int. World Wide Web Conference, May 2002.
      Google ScholarLocate open access versionFindings
    • B. T. Jonsson, M. J. Franklin, and D. Srivastava. Interaction of query evaluation and buffer management for information retrieval. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 118–129, June 1998.
      Google ScholarLocate open access versionFindings
    • M. Kaszkiel, J. Zobel, and R. Sacks-Davis. Efficient passage ranking for document databases. ACM Transactions on Information Systems (TOIS), 17(4):406–439, Oct. 1999.
      Google ScholarLocate open access versionFindings
    • R. Lempel and S. Moran. Optimizing result prefetching in web search engines with segmented indices. In Proc. of the 28th Int. Conf. on Very Large Data Bases, Aug. 2002.
      Google ScholarLocate open access versionFindings
    • R. Lempel and S. Moran. Predictive caching and prefetching of query results in search engines. In Proc. of the 12th Int. World-Wide Web Conference, 2003.
      Google ScholarLocate open access versionFindings
    • J. Li, B. Loo, J. Hellerstein, F. Kaashoek, D. Karger, and R. Morris. On the feasibility of peer-to-peer web indexing. In Proc. of the 2nd Int. Workshop on Peer-to-Peer Systems, 2003.
      Google ScholarLocate open access versionFindings
    • X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proc. of the 29th Int. Conf. on Very Large Data Bases, September 2003.
      Google ScholarLocate open access versionFindings
    • E. Markatos. On caching search engine query results. In 5th International Web Caching and Content Delivery Workshop, May 2000.
      Google ScholarFindings
    • N. Megiddo and D. Modha. Outperforming LRU with an adaptive replacement cache. IEEE Computer, pages 58–65, April 2004.
      Google ScholarLocate open access versionFindings
    • S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proc. of the 10th Int. World Wide Web Conference, May 2000.
      Google ScholarLocate open access versionFindings
    • M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10):749–764, May 1996.
      Google ScholarLocate open access versionFindings
    • M. Richardson and P. Domingos. The intelligent surfer: Probabilistic combination of link and content information in pagerank. In Advances in Neural Information Processing Systems, 2002.
      Google ScholarLocate open access versionFindings
    • K. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. In First Latin American Web Congress, pages 132–143, 2003.
      Google ScholarLocate open access versionFindings
    • K. Risvik and R. Michelsen. Search engines and web dynamics. Computer Networks, 39:289–302, 2002.
      Google ScholarLocate open access versionFindings
    • P. Saraiva, E. de Moura, N. Ziviani, W. Meira, R. Fonseca, and B. Ribeiro-Neto. Rank-preserving two-level caching for scalable search engines. In Proc. of the 24th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 51–58, Sept. 2001.
      Google ScholarLocate open access versionFindings
    • F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of the 25th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 222–229, 2002.
      Google ScholarLocate open access versionFindings
    • V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. of the Int. Conf. on Data Engineering, 2002.
      Google ScholarLocate open access versionFindings
    • T. Suel, C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, and K. Shanmugasundaram. ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In International Workshop on the Web and Databases (WebDB), June 2003.
      Google ScholarLocate open access versionFindings
    • A. Tomasic and H. Garcia-Molina. Performance of inverted indices in distributed text document retrieval systems. In Proc. of the 2nd Int. Conf. on Parallel and Distributed Information Systems (PDIS), 1993.
      Google ScholarLocate open access versionFindings
    • I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition, 1999.
      Google ScholarFindings
    • Y. Xie and D. O’Hallaron. Locality in search engine queries and its implications for caching. In IEEE Infocom 2002, pages 1238–1247, 2002.
      Google ScholarLocate open access versionFindings
    • N. Young. On-line file caching. In Proc. of the 9th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 82–86, 1998.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Best Paper
    Best Paper of WWW, 2005
    Tags
    Comments