# Monitoring distributed streams using convex decompositions

PVLDB, 2015.

EI

Keywords:

Wei bo:

Abstract:

Emerging large-scale monitoring applications rely on continuous tracking of complex data-analysis queries over collections of massive, physically-distributed data streams. Thus, in addition to the space- and time-efficiency requirements of conventional stream processing (at each remote monitor site), effective solutions also need to guara...More

Code:

Data:

Introduction

- A task of increasing importance is the online monitoring of queries over continuous data streams.
- A fundamental query task in this setting is that of monitoring “threshold crossings”: Given a function f () and a threshold T , the query is defined by “is f (v) ≤ T ?”, where v is a dynamic vector that captures the current state of the streaming data
- This query model has been used in numerous applications [13], either directly or as the main building block for other queries such as top-k and “heavy-hitter” items, quantiles, and so on [26].
- Given the streaming nature of the data, the authors are, interested in continuous queries; that is, the query is “always there” and an alert must be issued whenever f (v) > T

Highlights

- A task of increasing importance is the online monitoring of queries over continuous data streams
- We describe some basic notions of the Covering Spheres method, which will be used as a baseline to compare with the convex decomposition method; along the way, we will introduce some definitions which will be used in Section 3
- The question that naturally arises has to do with the performance of the Covering Spheres method and the Safe Zones defined by its convex admissible subset Covering Spheres: we demonstrate how Covering Spheres can be drastically improved in some cases, by intersecting much fewer half-spaces (Theorem 1) in order to obtain a larger convex admissible subset
- We implemented and compared the convex decomposition and Covering Spheres methods on two data sets, which were used in [11, 8]: “Wcup”, which contains access logs to the soccer World Cup 1998 website7, and “Cdad”, which contains SNMP requests of network users, such as number of packets and bytes from/to each user’s machine
- We presented a monitoring algorithm for a distributed stream system, which is based on a convex decomposition (CD) of a subset of the queried function’s domain, and proved that it always improves on the covering spheres (CS) method
- Future research will concentrate on applying convex decomposition to other functions and scenarios

Results

- We implemented and compared the CD and CS methods on two data sets, which were used in [11, 8]: “Wcup”, which contains access logs to the soccer World Cup 1998 website7, and “Cdad”, which contains SNMP requests of network users, such as number of packets and bytes from/to each user’s machine.8 The data were collected from a corporate research center (IBM Watson) over several weeks.
- The parameters defining each run were the number of nodes (4 to 20), and ǫ and δ, which determine the accuracy and probabilistic guarantee level of the sketch, as well as its size (Section 4.1).
- The monitored function was fN , and a threshold crossing is defined by a relative deviation of no more than θ from the current value.
- The authors fixed sketch accuracy to δ = 2−11 and ǫ = 0.05, and ran tests for different values of node numbers and θ

Conclusion

- The authors presented a monitoring algorithm for a distributed stream system, which is based on a convex decomposition (CD) of a subset of the queried function’s domain, and proved that it always improves on the covering spheres (CS) method.
- To apply CD, a non-redundant convex decomposition should be first constructed.
- The authors showed how to achieve this for an important family of queries, namely range and norm queries over distributed streams, for the general case in which both negative and positive updates of the frequency counts are allowed.
- Future research will concentrate on applying CD to other functions and scenarios

Summary

## Introduction:

A task of increasing importance is the online monitoring of queries over continuous data streams.- A fundamental query task in this setting is that of monitoring “threshold crossings”: Given a function f () and a threshold T , the query is defined by “is f (v) ≤ T ?”, where v is a dynamic vector that captures the current state of the streaming data
- This query model has been used in numerous applications [13], either directly or as the main building block for other queries such as top-k and “heavy-hitter” items, quantiles, and so on [26].
- Given the streaming nature of the data, the authors are, interested in continuous queries; that is, the query is “always there” and an alert must be issued whenever f (v) > T
## Results:

We implemented and compared the CD and CS methods on two data sets, which were used in [11, 8]: “Wcup”, which contains access logs to the soccer World Cup 1998 website7, and “Cdad”, which contains SNMP requests of network users, such as number of packets and bytes from/to each user’s machine.8 The data were collected from a corporate research center (IBM Watson) over several weeks.- The parameters defining each run were the number of nodes (4 to 20), and ǫ and δ, which determine the accuracy and probabilistic guarantee level of the sketch, as well as its size (Section 4.1).
- The monitored function was fN , and a threshold crossing is defined by a relative deviation of no more than θ from the current value.
- The authors fixed sketch accuracy to δ = 2−11 and ǫ = 0.05, and ran tests for different values of node numbers and θ
## Conclusion:

The authors presented a monitoring algorithm for a distributed stream system, which is based on a convex decomposition (CD) of a subset of the queried function’s domain, and proved that it always improves on the covering spheres (CS) method.- To apply CD, a non-redundant convex decomposition should be first constructed.
- The authors showed how to achieve this for an important family of queries, namely range and norm queries over distributed streams, for the general case in which both negative and positive updates of the frequency counts are allowed.
- Future research will concentrate on applying CD to other functions and scenarios

Funding

- This work was supported by the European Commission under ICT-FP7-FERARI-619491 (Flexible Event Processing for big Data Architectures)

Reference

- http://tinyurl.com/oh3xvp6. Technical report.
- N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In PODS, 1999.
- C. Arackaparambil, J. Brody, and A. Chakrabarti. Functional monitoring without monotonicity. In ICALP (1), 2009.
- B. Babcock and C. Olston. Distributed top-k monitoring. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, New York, NY, USA, 2003. ACM.
- A. Bar-Or, D. Keren, A. Schuster, and R. Wolff. Hierarchical decision tree induction in distributed genomic databases. IEEE Trans. Knowl. Data Eng., 17(8):1138–1151, 2005.
- S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
- S. Burdakis and A. Deligiannakis. Detecting outliers in sensor networks using the geometric approach. In ICDE, 2012.
- G. Cormode and M. N. Garofalakis. Approximate continuous querying over distributed streams. ACM Trans. Database Syst., 33(2), 2008.
- G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed functional monitoring. In SODA, 2008.
- M. Gabel, A. Schuster, and D. Keren. Communication-efficient distributed variance monitoring and outlier detection for multivariate time series. In IEEE IPDPS, pages 37–47, 2014.
- M. N. Garofalakis, D. Keren, and V. Samoladas. Sketch-based geometric monitoring of distributed stream queries. PVLDB, 2013.
- N. Giatrakos, A. Deligiannakis, M. N. Garofalakis, I. Sharfman, and A. Schuster. Prediction-based geometric monitoring over distributed data streams. In SIGMOD, 2012.
- L. Golab and M. T. Ozsu. Issues in data stream management. SIGMOD Record, 32(2):5–14, 2003.
- R. Gupta, K. Ramamritham, and M. K. Mohania. Ratio threshold queries over distributed data sources. In ICDE, 2010.
- L. Huang, X. Nguyen, M. N. Garofalakis, J. M. Hellerstein, M. I. Jordan, A. D. Joseph, and N. Taft. Communication-efficient online detection of network-wide anomalies. In INFOCOM, 2007.
- S. R. Kashyap, J. Ramamirtham, R. Rastogi, and P. Shukla. Efficient constraint monitoring using adaptive thresholds. In ICDE, pages 526–535, 2008.
- R. Keralapura, G. Cormode, and J. Ramamirtham. Communication-efficient distributed monitoring of thresholded counts. In SIGMOD, 2006.
- D. Keren, I. Sharfman, A. Schuster, and A. Livne. Shape sensitive geometric monitoring. IEEE Trans. Knowl. Data Eng., 24(8), 2012.
- S. Meng, T. Wang, and L. Liu. Monitoring continuous state violation in datacenters: Exploring the time dimension. In ICDE, pages 968–979, 2010.
- S. Michel, P. Triantafillou, and G. Weikum. Klee: a framework for distributed top-k query algorithms. In VLDB ’05. VLDB Endowment, 2005.
- S. Shah and K. Ramamritham. Handling non-linear polynomial queries over dynamic data. In ICDE, 2008.
- I. Sharfman, A. Schuster, and D. Keren. A geometric approach to monitoring threshold functions over distributed data streams. In SIGMOD, 2006.
- I. Sharfman, A. Schuster, and D. Keren. A geometric approach to monitoring threshold functions over distributed data streams. ACM Trans. Database Syst., 32(4), 2007.
- Z. Wei and K. Yi. Beyond simple aggregates: indexing for summary queries. In PODS, pages 117–128, 2011.
- R. Wolff, K. Bhaduri, and H. Kargupta. A generic local algorithm for mining data streams in large distributed systems. IEEE Trans. on Knowl. and Data Eng., 21(4), 2009.
- K. Yi and Q. Zhang. Optimal tracking of distributed heavy hitters and quantiles. In PODS, 2009.

Tags

Comments