Approximation algorithms for clustering streams and large data sets

Approximation algorithms for clustering streams and large data sets(2003)

引用 25|浏览9
暂无评分
摘要
As data collection and storage become easier and cheaper, corporate and research data warehouses have proliferated and grown. In exploring the ways in which these vast resources can be used, datamining algorithm designers face new problems that arise due to the size and nature of many data sets. The amount of data that can be stored on a set of disks has far outstripped the quantity that can be processed in the main memory of a computer, and in many cases it is desirable to process information as it is generated, and then discard all but a short synopsis. New data models that reflect the importance of processing a large amount of data using a small amount of space and time have been developed which allow the designers of new algorithms to evaluate the relative benefits of various approaches. We will discuss three such models, each with different algorithmic requirements: the Very Large Data Set model, which applies to static data sets stored on distributed systems but larger than main memory; the Data Stream model, which applies to static data sets larger than main memory that are stored on slow, linearly accessible devices; and the Sliding Window model, which assumes a data set that may be in the process of being generated, without any foreseen end, of which, at any moment, only a recent segment (which is much larger than main memory) is of interest. Clustering, or grouping data into representative subsets, is a tool commonly used in data analysis, and can often make the manipulation of large data sets simpler. For example, a search engine could cluster the pages returned by a query, and then return a summary of the categories of pages found, enabling a user to narrow or change the search without reading through a long list of results. We will present algorithms for variants of the clustering problems known as k-Median and k-Center, both NP-Hard problems that admit good approximation algorithms. For each of the above-mentioned data models, we will present constant-factor approximation algorithms for various versions of k-Median and k-Center.
更多
查看译文
关键词
data collection,data analysis,large data,large data set,clustering stream,data set,grouping data,above-mentioned data model,research data warehouse,approximation algorithm,static data,new data model,main memory
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要