Optimization problems in data mining

Optimization problems in data mining(2004)

引用 23|浏览14
暂无评分
摘要
One natural, yet unusual, source of data is the set of queries that are performed on a database. We consider such queries to be reflective of data access patterns and we use them to create indices on the data that are likely to be useful in minimizing the cost of answering future queries. We formalize the problem of finding these optimal indices under a constraint on the total amount of space available for storing them, we give strong negative and positive performance bounds, and we quantify the error in performance introduced by running the algorithm on a sample drawn from an unknown query distribution. We investigate the problem of finding optimized support association rules for a single numerical attribute, where the optimized region is a union of k disjoint intervals from the range of the attribute. We give the first polynomial time algorithm for the problem of finding such a region maximizing support and meeting a cumulative confidence threshold. Experiments demonstrate that the best algorithm for a more constrained version of the problem has performance degradation on both synthetic and real world data. We prove theoretical bounds on sufficient sample size to achieve a given performance level, and we validate convergence on synthetic and real-world data experimentally. We propose a natural greedy algorithm, and analyze its performance. We introduce a novel type of rule, wherein claims of the form “our object ranked r or better in x of the last t time units,” are formalized, and where maximal claims of this form are defined under two natural partial orders. For the first, we give an efficient and optimal algorithm for finding all such claims. For the second, we give an algorithm whose running time is significantly more efficient than that of a naïve one. Finally, we connect this boasting problem to that of finding a sequence of optimized confidence association rules, and give an efficient algorithm for solving a simplification of the problem.
更多
查看译文
关键词
best algorithm,efficient algorithm,natural greedy algorithm,optimal algorithm,polynomial time algorithm,boasting problem,data access pattern,performance degradation,performance level,positive performance bound,data mining,optimization problem
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要