Identifying and characterising anomalies in data
Identifying and characterising anomalies in data(2012)
摘要
The increased storage and processing capacity make the collection and storage of data much easier. There is, however, a whole process needed to acquire knowledge from the data or to exploit the data in an application. One of the cornerstones in this process is data mining. In this domain algorithms are designed to summarise the data in models, and to discover unexpected relationships, patterns, in the data. To regain grip on the ever growing amounts of data, these models and patterns need to be both useful and understandable to the data owner. In this dissertation we develop data mining techniques to build, starting from the available data with limited human effort, models with the aim of accurately identifying anomalies, observations deviating from the expected norm, in (new) data and characterise these understandable. Since in practice we are confronted with a variety of potential applications and different types of data, it is too far-fetched to develop a comprehensive approach where in each step of the process all requirements are simultaneously met, optimised and validated. Throughout this thesis we therefore focus on some aspects of this problem. Moreover, we assume that, for building the normal model from the data, we mainly have examples of the expected situation. After a brief general introduction Chapter 1, we will explore in Chapter 2 and 3 two specific real world problems. In chapter 2, we present an algorithm to detect anomalies during the monitoring of production processes in a chemical plant. In Chapter 3, we initiate the data-driven identification of vandalism in Wikipedia. Identification of anomalies alone is not enough however. We want a description that explains why an observation is regarded as unexpected. By explicitly using a limited number of patterns that describe the normal expectations and illuminating the differences with the current observation, we provide the necessary insight in Chapter 4. Collecting such compact descriptions directly and efficiently from the data is the subject Chapter 5.
更多查看译文
关键词
characterising anomaly,current observation,subject Chapter,data mining,available data,data-driven identification,production process,data owner,brief general introduction Chapter,whole process,data mining technique
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络