Sketch-based geometric monitoring of distributed stream queries

    PVLDB, pp. 937-948, 2013.

    Cited by: 57|Bibtex|Views6|Links
    EI
    Keywords:
    sketch-based geometric monitoringcommunication costremote sitemassive data streamphysically-distributed data streamMore(7+)
    Wei bo:
    We utilized AMS sketches, to previous work, but in a novel way; we developed a novel geometric method of dynamic balancing of error between remote sites, improving summarization at the sites before stream data has to be transferred over the network

    Abstract:

    Emerging large-scale monitoring applications rely on continuous tracking of complex data-analysis queries over collections of massive, physically-distributed data streams. Thus, in addition to the space- and time-efficiency requirements of conventional stream processing (at each remote monitor site), effective solutions also need to guara...More

    Code:

    Data:

    0
    Introduction
    • Traditional data-management systems are typically built on a pull-based paradigm, where users issue one-shot queries to static data sets residing on disk, and the system processes these queries and returns their results.
    • The vast majority of these applications are inherently distributed, with several remote monitor sites observing their local, high-speed data streams and exchanging information through a communication network
    • This distribution of the data naturally implies critical communication constraints that typically prohibit centralizing all the streaming data, due to either the huge volume of the data, or power and bandwidth restrictions.
    • Monitoring the precise value of such holistic queries without continuously centralizing all the data seems hopeless; luckily, when tracking statistical behavior and patters in large scale systems, approximate answers are typically sufficient
    • This often allows algorithms to effectively tradeoff efficiency with approximation quality
    Highlights
    • Traditional data-management systems are typically built on a pull-based paradigm, where users issue one-shot queries to static data sets residing on disk, and the system processes these queries and returns their results
    • An important requirement of large-scale event monitoring is the effective support for tracking complex, holistic queries that provide a global view of the data by combining and correlating information across the collection of remote monitor sites
    • The problem addressed in this paper is monitoring of massive, distributed streaming data
    • The recently proposed geometric method has been combined with AMS sketches towards reducing the communication cost of tracking complex aggregate queries over distributed streams with strict error bounds
    • We utilized AMS sketches, to previous work, but in a novel way; we developed a novel geometric method of dynamic balancing of error between remote sites, improving summarization at the sites before stream data has to be transferred over the network
    • Our experimental study with real-life data sets demonstrates the practical benefits of our approach, showing consistent gains of up to 35% in terms of total communication cost compared to the current state-of-the-art method [7]; our techniques demonstrate even more impressive benefits when focusing on the communication costs of data shipping in the system
    • The techniques we developed exhibit much improved performance compared to previous techniques but fail to scale performance-wise when the number of remote sites increases
    Results
    • Communication Cost.
    • The results presented measure the communication cost incurred by the methods.
    • In order to contrast better with the techniques of [7], the authors do not present absolute cost, by rather the cost scaled relative to the cost of the CG method.
    • In these experiments, the number of remote sites is 4.
    Conclusion
    • The problem addressed in this paper is monitoring of massive, distributed streaming data.
    • The authors utilized AMS sketches, to previous work, but in a novel way; the authors developed a novel geometric method of dynamic balancing of error between remote sites, improving summarization at the sites before stream data has to be transferred over the network.
    • A fruitful problem of future research will be to enhance the standard geometric method, adapting it to the particularities of sketchbased monitoring, in order to improve scalability
    • Another promising direction for extension is the adoption of dynamic predictive error models.
    • The authors intend to combine the techniques with other types of sketches from the literature and extend their applicability to new types of queries
    Summary
    • Introduction:

      Traditional data-management systems are typically built on a pull-based paradigm, where users issue one-shot queries to static data sets residing on disk, and the system processes these queries and returns their results.
    • The vast majority of these applications are inherently distributed, with several remote monitor sites observing their local, high-speed data streams and exchanging information through a communication network
    • This distribution of the data naturally implies critical communication constraints that typically prohibit centralizing all the streaming data, due to either the huge volume of the data, or power and bandwidth restrictions.
    • Monitoring the precise value of such holistic queries without continuously centralizing all the data seems hopeless; luckily, when tracking statistical behavior and patters in large scale systems, approximate answers are typically sufficient
    • This often allows algorithms to effectively tradeoff efficiency with approximation quality
    • Results:

      Communication Cost.
    • The results presented measure the communication cost incurred by the methods.
    • In order to contrast better with the techniques of [7], the authors do not present absolute cost, by rather the cost scaled relative to the cost of the CG method.
    • In these experiments, the number of remote sites is 4.
    • Conclusion:

      The problem addressed in this paper is monitoring of massive, distributed streaming data.
    • The authors utilized AMS sketches, to previous work, but in a novel way; the authors developed a novel geometric method of dynamic balancing of error between remote sites, improving summarization at the sites before stream data has to be transferred over the network.
    • A fruitful problem of future research will be to enhance the standard geometric method, adapting it to the particularities of sketchbased monitoring, in order to improve scalability
    • Another promising direction for extension is the adoption of dynamic predictive error models.
    • The authors intend to combine the techniques with other types of sketches from the literature and extend their applicability to new types of queries
    Funding
    • This work was partially supported by the European Commission under ICT-FP7- LIFT-255951 (Local Inference in Massively Distributed Systems)
    Reference
    • N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. “Tracking Join and Self-Join Sizes in Limited Storage”. In ACM PODS, 1999.
      Google ScholarLocate open access versionFindings
    • N. Alon, Y. Matias, and M. Szegedy. “The Space Complexity of Approximating the Frequency Moments”. In ACM STOC, 1996.
      Google ScholarLocate open access versionFindings
    • B. Babcock and C. Olston. “Distributed Top-K Monitoring”. In ACM SIGMOD, 2003.
      Google ScholarLocate open access versionFindings
    • M. Charikar, K. Chen, and M. Farach-Colton. “Finding Frequent Items in Data Streams”. In ICALP, 2002.
      Google ScholarLocate open access versionFindings
    • D. Chu, A. Deshpande, J. M. Hellerstein, and W. Hong. “Approximate Data Collection in Sensor Networks using Probabilistic Models”. In IEEE ICDE, 2006.
      Google ScholarLocate open access versionFindings
    • G. Cormode and M. Garofalakis. Streaming in a connected world: querying and tracking distributed data streams. In ACM SIGMOD, 2007.
      Google ScholarLocate open access versionFindings
    • G. Cormode and M. Garofalakis. “Approximate Continuous Querying of Distributed Streams”. ACM TODS, 33(2), 2008.
      Google ScholarLocate open access versionFindings
    • G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi. “Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles”. In ACM SIGMOD, 2005.
      Google ScholarLocate open access versionFindings
    • G. Cormode, M. Garofalakis, and D. Sacharidis. “Fast Approximate Wavelet Tracking on Streams”. In EDBT, 2006.
      Google ScholarLocate open access versionFindings
    • G. Cormode and S. Muthukrishnan. “What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically”. In ACM PODS, 2003.
      Google ScholarLocate open access versionFindings
    • G. Cormode and S. Muthukrishnan. “An improved data stream summary: The count-min sketch and its applications”. In Jrnl. of Algorithms, 55(1), 2005.
      Google ScholarLocate open access versionFindings
    • C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. “Gigascope: A Stream Database for Network Applications”. In ACM SIGMOD, 2003.
      Google ScholarLocate open access versionFindings
    • A. Das, S. Ganguly, M. Garofalakis, and R. Rastogi. “Distributed Set-Expression Cardinality Estimation”. In VLDB, 2004.
      Google ScholarLocate open access versionFindings
    • A. Deshpande, C. Guestrin, S. R. Madden, J. M. Hellerstein, and W. Hong. “Model-Driven Data Acquisition in Sensor Networks”. In VLDB, 2004.
      Google ScholarLocate open access versionFindings
    • A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. “Processing Complex Aggregate Queries over Data Streams”. In ACM SIGMOD, 2002.
      Google ScholarLocate open access versionFindings
    • S. Ganguly, M. Garofalakis, and R. Rastogi. “Processing Set Expressions over Continuous Update Streams”. In ACM SIGMOD, 2003.
      Google ScholarLocate open access versionFindings
    • N. Giatrakos, A. Deligiannakis, M. Garofalakis, I. Sharfman, and A. Schuster. “Prediction-based Geometric Monitoring over Distributed Data Streams”. In ACM SIGMOD, 2012.
      Google ScholarLocate open access versionFindings
    • P. B. Gibbons. “Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports”. In VLDB, 2001.
      Google ScholarLocate open access versionFindings
    • A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. “How to Summarize the Universe: Dynamic Maintenance of Quantiles”. In VLDB, 2002.
      Google ScholarLocate open access versionFindings
    • A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. “One-pass wavelet decomposition of data streams”. IEEE TKDE, 15(3), 2003.
      Google ScholarLocate open access versionFindings
    • M. B. Greenwald and S. Khanna. “Space-Efficient Online Computation of Quantile Summaries”. In ACM SIGMOD, 2001.
      Google ScholarLocate open access versionFindings
    • M. B. Greenwald and S. Khanna. “Power-Conserving Computation of Order-Statistics over Sensor Networks”. In ACM PODS, 2004.
      Google ScholarLocate open access versionFindings
    • A. Jain, E. Y. Chang, and Y.-F. Wang. “Adaptive stream resource management using Kalman Filters”. In ACM SIGMOD, 2004.
      Google ScholarLocate open access versionFindings
    • R. Keralapura, G. Cormode, and J. Ramamirtham. “Communication-efficient distributed monitoring of thresholded counts”. In ACM SIGMOD, 2006.
      Google ScholarLocate open access versionFindings
    • D. Keren, I. Sharfman, A. Schuster, and A. Livne. “Shape-Sensitive Geometric Monitoring”. IEEE TKDE, 24(8), 2012.
      Google ScholarLocate open access versionFindings
    • S. R. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. “The Design of an Acquisitional Query Processor for Sensor Networks”. In ACM SIGMOD, 2003.
      Google ScholarLocate open access versionFindings
    • A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. “Finding (Recently) Frequent Items in Distributed Data Streams”. In IEEE ICDE, 2005.
      Google ScholarLocate open access versionFindings
    • G. S. Manku and R. Motwani. “Approximate Frequency Counts over Data Streams”. In VLDB, 2002.
      Google ScholarLocate open access versionFindings
    • NII Shonan Workshop on Large-Scale Distributed Computation, Shonan Village, Japan, January 2012. http://www.nii.ac.jp/shonan/seminar011/.
      Findings
    • C. Olston, J. Jiang, and J. Widom. “Adaptive Filters for Continuous Queries over Distributed Data Streams”. In ACM SIGMOD, 2003.
      Google ScholarLocate open access versionFindings
    • I. Sharfman, A. Schuster, and D. Keren. “A geometric approach to monitoring threshold functions over distributed data streams”. In ACM SIGMOD, 2006.
      Google ScholarLocate open access versionFindings
    • N. Thaper, S. Guha, P. Indyk, and N. Koudas. “Dynamic Multidimensional Histograms”. In ACM SIGMOD, 2002.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments