Monitoring distributed streams using convex decompositions

    PVLDB, 2015.

    Cited by: 38|Bibtex|Views22|Links
    EI
    Keywords:
    datum analysisgeometric approachsphere methoddatum streamnumerous applicationMore(6+)
    Wei bo:
    We presented a monitoring algorithm for a distributed stream system, which is based on a convex decomposition of a subset of the queried function’s domain, and proved that it always improves on the covering spheres method

    Abstract:

    Emerging large-scale monitoring applications rely on continuous tracking of complex data-analysis queries over collections of massive, physically-distributed data streams. Thus, in addition to the space- and time-efficiency requirements of conventional stream processing (at each remote monitor site), effective solutions also need to guara...More

    Code:

    Data:

    0
    Introduction
    • A task of increasing importance is the online monitoring of queries over continuous data streams.
    • A fundamental query task in this setting is that of monitoring “threshold crossings”: Given a function f () and a threshold T , the query is defined by “is f (v) ≤ T ?”, where v is a dynamic vector that captures the current state of the streaming data
    • This query model has been used in numerous applications [13], either directly or as the main building block for other queries such as top-k and “heavy-hitter” items, quantiles, and so on [26].
    • Given the streaming nature of the data, the authors are, interested in continuous queries; that is, the query is “always there” and an alert must be issued whenever f (v) > T
    Highlights
    • A task of increasing importance is the online monitoring of queries over continuous data streams
    • We describe some basic notions of the Covering Spheres method, which will be used as a baseline to compare with the convex decomposition method; along the way, we will introduce some definitions which will be used in Section 3
    • The question that naturally arises has to do with the performance of the Covering Spheres method and the Safe Zones defined by its convex admissible subset Covering Spheres: we demonstrate how Covering Spheres can be drastically improved in some cases, by intersecting much fewer half-spaces (Theorem 1) in order to obtain a larger convex admissible subset
    • We implemented and compared the convex decomposition and Covering Spheres methods on two data sets, which were used in [11, 8]: “Wcup”, which contains access logs to the soccer World Cup 1998 website7, and “Cdad”, which contains SNMP requests of network users, such as number of packets and bytes from/to each user’s machine
    • We presented a monitoring algorithm for a distributed stream system, which is based on a convex decomposition (CD) of a subset of the queried function’s domain, and proved that it always improves on the covering spheres (CS) method
    • Future research will concentrate on applying convex decomposition to other functions and scenarios
    Results
    • We implemented and compared the CD and CS methods on two data sets, which were used in [11, 8]: “Wcup”, which contains access logs to the soccer World Cup 1998 website7, and “Cdad”, which contains SNMP requests of network users, such as number of packets and bytes from/to each user’s machine.8 The data were collected from a corporate research center (IBM Watson) over several weeks.
    • The parameters defining each run were the number of nodes (4 to 20), and ǫ and δ, which determine the accuracy and probabilistic guarantee level of the sketch, as well as its size (Section 4.1).
    • The monitored function was fN , and a threshold crossing is defined by a relative deviation of no more than θ from the current value.
    • The authors fixed sketch accuracy to δ = 2−11 and ǫ = 0.05, and ran tests for different values of node numbers and θ
    Conclusion
    • The authors presented a monitoring algorithm for a distributed stream system, which is based on a convex decomposition (CD) of a subset of the queried function’s domain, and proved that it always improves on the covering spheres (CS) method.
    • To apply CD, a non-redundant convex decomposition should be first constructed.
    • The authors showed how to achieve this for an important family of queries, namely range and norm queries over distributed streams, for the general case in which both negative and positive updates of the frequency counts are allowed.
    • Future research will concentrate on applying CD to other functions and scenarios
    Summary
    • Introduction:

      A task of increasing importance is the online monitoring of queries over continuous data streams.
    • A fundamental query task in this setting is that of monitoring “threshold crossings”: Given a function f () and a threshold T , the query is defined by “is f (v) ≤ T ?”, where v is a dynamic vector that captures the current state of the streaming data
    • This query model has been used in numerous applications [13], either directly or as the main building block for other queries such as top-k and “heavy-hitter” items, quantiles, and so on [26].
    • Given the streaming nature of the data, the authors are, interested in continuous queries; that is, the query is “always there” and an alert must be issued whenever f (v) > T
    • Results:

      We implemented and compared the CD and CS methods on two data sets, which were used in [11, 8]: “Wcup”, which contains access logs to the soccer World Cup 1998 website7, and “Cdad”, which contains SNMP requests of network users, such as number of packets and bytes from/to each user’s machine.8 The data were collected from a corporate research center (IBM Watson) over several weeks.
    • The parameters defining each run were the number of nodes (4 to 20), and ǫ and δ, which determine the accuracy and probabilistic guarantee level of the sketch, as well as its size (Section 4.1).
    • The monitored function was fN , and a threshold crossing is defined by a relative deviation of no more than θ from the current value.
    • The authors fixed sketch accuracy to δ = 2−11 and ǫ = 0.05, and ran tests for different values of node numbers and θ
    • Conclusion:

      The authors presented a monitoring algorithm for a distributed stream system, which is based on a convex decomposition (CD) of a subset of the queried function’s domain, and proved that it always improves on the covering spheres (CS) method.
    • To apply CD, a non-redundant convex decomposition should be first constructed.
    • The authors showed how to achieve this for an important family of queries, namely range and norm queries over distributed streams, for the general case in which both negative and positive updates of the frequency counts are allowed.
    • Future research will concentrate on applying CD to other functions and scenarios
    Funding
    • This work was supported by the European Commission under ICT-FP7-FERARI-619491 (Flexible Event Processing for big Data Architectures)
    Reference
    • http://tinyurl.com/oh3xvp6. Technical report.
      Findings
    • N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In PODS, 1999.
      Google ScholarLocate open access versionFindings
    • C. Arackaparambil, J. Brody, and A. Chakrabarti. Functional monitoring without monotonicity. In ICALP (1), 2009.
      Google ScholarLocate open access versionFindings
    • B. Babcock and C. Olston. Distributed top-k monitoring. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, New York, NY, USA, 2003. ACM.
      Google ScholarLocate open access versionFindings
    • A. Bar-Or, D. Keren, A. Schuster, and R. Wolff. Hierarchical decision tree induction in distributed genomic databases. IEEE Trans. Knowl. Data Eng., 17(8):1138–1151, 2005.
      Google ScholarLocate open access versionFindings
    • S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
      Google ScholarFindings
    • S. Burdakis and A. Deligiannakis. Detecting outliers in sensor networks using the geometric approach. In ICDE, 2012.
      Google ScholarLocate open access versionFindings
    • G. Cormode and M. N. Garofalakis. Approximate continuous querying over distributed streams. ACM Trans. Database Syst., 33(2), 2008.
      Google ScholarLocate open access versionFindings
    • G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed functional monitoring. In SODA, 2008.
      Google ScholarLocate open access versionFindings
    • M. Gabel, A. Schuster, and D. Keren. Communication-efficient distributed variance monitoring and outlier detection for multivariate time series. In IEEE IPDPS, pages 37–47, 2014.
      Google ScholarLocate open access versionFindings
    • M. N. Garofalakis, D. Keren, and V. Samoladas. Sketch-based geometric monitoring of distributed stream queries. PVLDB, 2013.
      Google ScholarLocate open access versionFindings
    • N. Giatrakos, A. Deligiannakis, M. N. Garofalakis, I. Sharfman, and A. Schuster. Prediction-based geometric monitoring over distributed data streams. In SIGMOD, 2012.
      Google ScholarLocate open access versionFindings
    • L. Golab and M. T. Ozsu. Issues in data stream management. SIGMOD Record, 32(2):5–14, 2003.
      Google ScholarLocate open access versionFindings
    • R. Gupta, K. Ramamritham, and M. K. Mohania. Ratio threshold queries over distributed data sources. In ICDE, 2010.
      Google ScholarLocate open access versionFindings
    • L. Huang, X. Nguyen, M. N. Garofalakis, J. M. Hellerstein, M. I. Jordan, A. D. Joseph, and N. Taft. Communication-efficient online detection of network-wide anomalies. In INFOCOM, 2007.
      Google ScholarFindings
    • S. R. Kashyap, J. Ramamirtham, R. Rastogi, and P. Shukla. Efficient constraint monitoring using adaptive thresholds. In ICDE, pages 526–535, 2008.
      Google ScholarLocate open access versionFindings
    • R. Keralapura, G. Cormode, and J. Ramamirtham. Communication-efficient distributed monitoring of thresholded counts. In SIGMOD, 2006.
      Google ScholarLocate open access versionFindings
    • D. Keren, I. Sharfman, A. Schuster, and A. Livne. Shape sensitive geometric monitoring. IEEE Trans. Knowl. Data Eng., 24(8), 2012.
      Google ScholarLocate open access versionFindings
    • S. Meng, T. Wang, and L. Liu. Monitoring continuous state violation in datacenters: Exploring the time dimension. In ICDE, pages 968–979, 2010.
      Google ScholarLocate open access versionFindings
    • S. Michel, P. Triantafillou, and G. Weikum. Klee: a framework for distributed top-k query algorithms. In VLDB ’05. VLDB Endowment, 2005.
      Google ScholarLocate open access versionFindings
    • S. Shah and K. Ramamritham. Handling non-linear polynomial queries over dynamic data. In ICDE, 2008.
      Google ScholarLocate open access versionFindings
    • I. Sharfman, A. Schuster, and D. Keren. A geometric approach to monitoring threshold functions over distributed data streams. In SIGMOD, 2006.
      Google ScholarLocate open access versionFindings
    • I. Sharfman, A. Schuster, and D. Keren. A geometric approach to monitoring threshold functions over distributed data streams. ACM Trans. Database Syst., 32(4), 2007.
      Google ScholarLocate open access versionFindings
    • Z. Wei and K. Yi. Beyond simple aggregates: indexing for summary queries. In PODS, pages 117–128, 2011.
      Google ScholarLocate open access versionFindings
    • R. Wolff, K. Bhaduri, and H. Kargupta. A generic local algorithm for mining data streams in large distributed systems. IEEE Trans. on Knowl. and Data Eng., 21(4), 2009.
      Google ScholarLocate open access versionFindings
    • K. Yi and Q. Zhang. Optimal tracking of distributed heavy hitters and quantiles. In PODS, 2009.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments