Algorithms for data stream systems

Algorithms for data stream systems(2004)

引用 27|浏览17
暂无评分
摘要
In a growing number of information-processing applications, data takes the form of “continuous data streams” rather than traditional stored databases. These applications share several distinguishing features like the need for real time analysis, huge volumes of data, and unpredictable and bursty data arrivals. Application areas include network-traffic monitoring, sensor networks, and many more. For such applications, it is not feasible to simply load the arriving data into a traditional database management system (DBMS) or load it into main memory, and operate on it there. The failure to use these solutions precludes random access to data elements that is assumed in the traditional main memory model of computation. This leads us to the data stream model of computation, in which some or all of the input data that is to be operated on, is not available for random access from disk or memory, but rather arrives as one or more continuous data streams . Using a small amount of main memory and small update time, we are required to give continuous answers to queries posed over data streams. This thesis considers two sets of issues in the area of data-stream computation: algorithms and theory, and techniques for runtime resource allocation in systems designed for stream computations. In the area of algorithms we present space- and time-efficient algorithms for a number of problems in the sliding-window model. In this model only the last N elements, where N is the window size, are relevant to gathering statistics or answering queries. We also propose the computation of Hamming norms over data streams, which is a more general problem than the well-studied distinct value estimation problem, and present efficient algorithms for Hamming-norm computation in the presence of inserts and deletes. In the area of stream-systems we study two runtime resource allocation problems: operator scheduling and load shedding. We develop an operator-scheduling strategy called Chain that minimizes the total memory requirement of the system. We then study load shedding for aggregations queries, and present optimal algorithms that minimize the inaccuracy in query answers given a load constraint. We present a thorough experimental evaluation for both techniques.
更多
查看译文
关键词
continuous data stream,load constraint,input data,bursty data arrival,data element,Hamming-norm computation,data stream model,data stream system,data-stream computation,data stream,main memory
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要