SageDB - A Learned Database System

    Ani Kristo
    Ani Kristo
    Guillaume Leclerc
    Guillaume Leclerc
    Vikram Nathan
    Vikram Nathan

    CIDR, 2019.

    Cited by: 2|Bibtex|Views68|Links
    EI
    Keywords:
    cumulative distribution functionapproximate query processingdatum distributioncost modelgeneral purposeMore(14+)
    Wei bo:
    SageDB presents a radical new approach to build database systems, by using using ML models combined with program synthesis to generate system components

    Abstract:

    Modern data processing systems are designed to be general purpose, in that they can handle a wide variety of different schemas, data types, and data distributions, and aim to provide efficient access to that data via the use of optimizers and cost models. This general purpose nature results in systems that do not take advantage of the cha...More

    Code:

    Data:

    0
    Introduction
    • Database systems have a long history of automatically selecting efficient algorithms, e.g., a merge vs hash-join, based on data statistics.
    • A simple C program that loads 100M integers into an array and performs a summation over a range runs in about 300ms on a modern desktop, but doing the same operations in a modern database (Postgres 9.6) takes about 150 seconds
    • This represents a 500x overhead for a general purpose design that isn’t aware of the specific data distribution
    Highlights
    • Database systems have a long history of automatically selecting efficient algorithms, e.g., a merge vs hash-join, based on data statistics
    • We argue that learned components can fully replace core components of a database system such as index structures, sorting algorithms, or even the query executor
    • Assuming we can perform a lookup in the cumulative distribution function in constant time, this makes the lookup of any key an O(1) operation while traditional tree structures require O operations
    • There is one caveat: machine learning is usually about generalizability, and not the empirical cumulative distribution function. This thought experiment using a cumulative distribution function model shows how deep such a model could be embedded into a database system and what benefits it could provide
    • SageDB presents a radical new approach to build database systems, by using using ML models combined with program synthesis to generate system components
    Results
    • In [19] the authors showed that the RMI model can significantly outperform state-of-the-art index structures and is surprisingly easy to train.
    Conclusion
    • This thought experiment using a CDF model shows how deep such a model could be embedded into a database system and what benefits it could provide.
    • Even with perfect information about the input designing a near-optimal scheduling algorithms with low-complexity is extremely hard.
    • In such cases, learning data/query-specific algorithms might provide an interesting alternative.SageDB presents a radical new approach to build database systems, by using using ML models combined with program synthesis to generate system components.
    • The authors presented initial results and a preliminary design that show the promise of these ideas, as well as a collection of future directions that highlight the significant research opportunity presented by the approach
    Summary
    • Introduction:

      Database systems have a long history of automatically selecting efficient algorithms, e.g., a merge vs hash-join, based on data statistics.
    • A simple C program that loads 100M integers into an array and performs a summation over a range runs in about 300ms on a modern desktop, but doing the same operations in a modern database (Postgres 9.6) takes about 150 seconds
    • This represents a 500x overhead for a general purpose design that isn’t aware of the specific data distribution
    • Results:

      In [19] the authors showed that the RMI model can significantly outperform state-of-the-art index structures and is surprisingly easy to train.
    • Conclusion:

      This thought experiment using a CDF model shows how deep such a model could be embedded into a database system and what benefits it could provide.
    • Even with perfect information about the input designing a near-optimal scheduling algorithms with low-complexity is extremely hard.
    • In such cases, learning data/query-specific algorithms might provide an interesting alternative.SageDB presents a radical new approach to build database systems, by using using ML models combined with program synthesis to generate system components.
    • The authors presented initial results and a preliminary design that show the promise of these ideas, as well as a collection of future directions that highlight the significant research opportunity presented by the approach
    Funding
    • Presents a vision towards a new type of a data processing system, one which highly specializes to an application through code synthesis and machine learning
    • Presents our vision of SageDB, a new class of data management system that specializes itself to exploit the distributions of the data it stores and the queries it serves
    • Argues that learned components can fully replace core components of a database system such as index structures, sorting algorithms, or even the query executor
    • Argues in this paper that customization through learning is the most powerful form of customization and outline how SageDB deeply embeds models into all algorithms and data structures, making the models the brain of the database
    • Showed that the RMI model can significantly outperform state-of-the-art index structures and is surprisingly easy to train
    Reference
    • Moore Law is Dead but GPU will get 1000X faster by 2025. https://tinyurl.com/y9uec4w6.
      Findings
    • S. Agrawal, V. R. Narasayya, and B. Yang. Integrating vertical and horizontal partitioning into automated physical database design. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13-18, 2004, pages 359–370, 2004.
      Google ScholarLocate open access versionFindings
    • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
      Findings
    • S. Chaudhuri and V. R. Narasayya. An efficient cost-driven index selection tool for microsoft SQL server. In VLDB, 1997.
      Google ScholarLocate open access versionFindings
    • S. Chaudhuri and V. R. Narasayya. Self-tuning database systems: A decade of progress. In VLDB, 2007.
      Google ScholarLocate open access versionFindings
    • C. Curino, Y. Zhang, E. P. C. Jones, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. PVLDB, 3(1):48–57, 2010.
      Google ScholarLocate open access versionFindings
    • B. K. Debnath, D. J. Lilja, and M. F. Mokbel. Sard: A statistical approach for ranking database tuning parameters. In 2008 IEEE 24th International Conference on Data Engineering Workshop, pages 11–18, April 2008.
      Google ScholarLocate open access versionFindings
    • A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, August 31 - September 3 2004, pages 588–599, 2004.
      Google ScholarLocate open access versionFindings
    • S. Duan, V. Thummala, and S. Babu. Tuning database configuration parameters with ituned. PVLDB, 2(1):1246–1257, 2009.
      Google ScholarLocate open access versionFindings
    • R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01, pages 102–113, New York, NY, USA, 2001. ACM.
      Google ScholarLocate open access versionFindings
    • A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518–529, 1999.
      Google ScholarLocate open access versionFindings
    • H. Gupta, V. Harinarayan, A. Rajaraman, and J. D. Ullman. Index selection for OLAP. In Proceedings of the Thirteenth International Conference on Data Engineering, April 7-11, 1997 Birmingham U.K., pages 208–219, 1997.
      Google ScholarLocate open access versionFindings
    • IBM Knowledge Center. Table partitioning and multidimensional clustering tables. https://www.ibm.com/support/knowledgecenter/en/ SSEPGG 9.5.0/com.ibm.db2.luw.admin.partition.doc/ doc/c0021605.html, 2018.
      Findings
    • S. Idreos, K. Zoumpatianos, M. Athanassoulis, N. Dayan, B. Hentschel, M. S. Kester, D. Guo, L. M. Maas, W. Qin, A. Wasay, and Y. Sun. The periodic table of data structures. IEEE Data Eng. Bull., 41(3):64–75, 2018.
      Google ScholarLocate open access versionFindings
    • S. Idreos, K. Zoumpatianos, B. Hentschel, M. S. Kester, and D. Guo. The data calculator: Data structure design and cost synthesis from first principles and learned cost models. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pages 535–550, 2018.
      Google ScholarLocate open access versionFindings
    • M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Comput., 6(2):181–214, Mar. 1994.
      Google ScholarLocate open access versionFindings
    • A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. CoRR, abs/1809.00677, 2018.
      Findings
    • T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016.
      Findings
    • T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In SIGMOD, pages 489–504, 2018.
      Google ScholarLocate open access versionFindings
    • S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, and I. Stoica. Learning to optimize join queries with deep reinforcement learning, 2018.
      Google ScholarFindings
    • H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and M. Alizadeh. Learning scheduling algorithms for data processing clusters. arXiv preprint arXiv:1810.01963, 2018.
      Findings
    • R. Marcus and O. Papaemmanouil. Deep reinforcement learning for join order enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM@SIGMOD 2018, Houston, TX, USA, June 10, 2018, pages 3:1–3:4, 2018.
      Google ScholarLocate open access versionFindings
    • Oracle Help Center. Database vldb and partitioning guide: Partitioning concepts. https://docs.oracle.com/cd/B28359 01/server.111/b32024/partition.htm, 2018.
      Findings
    • J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi. Learning state representations for query optimization with deep reinforcement learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, DEEM’18, pages 4:1–4:4, New York, NY, USA, 2018. ACM.
      Google ScholarLocate open access versionFindings
    • R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122–144, 2004.
      Google ScholarLocate open access versionFindings
    • Y. Park, A. S. Tajik, M. Cafarella, and B. Mozafari. Database learning: Toward a database that becomes smarter every time. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 587–602, New York, NY, USA, 2017. ACM.
      Google ScholarLocate open access versionFindings
    • A. Pavlo, E. P. C. Jones, and S. B. Zdonik. On predictive modeling for optimizing transaction execution in parallel OLTP systems. PVLDB, 5(2):85–96, 2011.
      Google ScholarLocate open access versionFindings
    • J. Rao, C. Zhang, N. Megiddo, and G. M. Lohman. Automating physical database design in a parallel database. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 3-6, 2002, pages 558–569, 2002.
      Google ScholarLocate open access versionFindings
    • P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data, pages 23–34. ACM, 1979.
      Google ScholarLocate open access versionFindings
    • D. Spielman and S. Teng. Smoothed analysis: Motivation and discrete models. In Workshop on Algorithms and Data Structures (WADS), 2003.
      Google ScholarLocate open access versionFindings
    • D. Spielman and S. Teng. Smoothed analysis: An attempt to explain the behavior of algorithms in practice. Comm. ACM, 52(10):76–84, 2009.
      Google ScholarLocate open access versionFindings
    • M. Staib and S. Jegelka. Distributionally robust deep learning as a generalization of adversarial training. In NIPS workshop on Machine Learning and Computer Security, 2017.
      Google ScholarLocate open access versionFindings
    • M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O’Neil, P. O’Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: A column-oriented dbms. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB ’05, pages 553–564. VLDB Endowment, 2005.
      Google ScholarLocate open access versionFindings
    • D. G. Sullivan, M. I. Seltzer, and A. Pfeffer. Using probabilistic reasoning to automate software tuning. SIGMETRICS Perform. Eval. Rev., 32(1):404–405, June 2004.
      Google ScholarLocate open access versionFindings
    • V. Thummala and S. Babu. ituned: a tool for configuring and visualizing database parameters. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, pages 1231–1234, 2010.
      Google ScholarLocate open access versionFindings
    • G. Valentin, M. Zuliani, D. C. Zilio, G. M. Lohman, and A. Skelley. DB2 advisor: An optimizer smart enough to recommend its own indexes. In Proceedings of the 16th International Conference on Data Engineering, San Diego, California, USA, February 28 - March 3, 2000, pages 101–110, 2000.
      Google ScholarLocate open access versionFindings
    • D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang. Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 1009–1024, New York, NY, USA, 2017. ACM.
      Google ScholarLocate open access versionFindings
    • M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2–2. USENIX Association, 2012.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments