Activity Discovery: Sparse Motifs from Multivariate Time Series

msra(2008)

引用 25|浏览44
暂无评分
摘要
In a set of time series or other sequence data, a motif is a collection of relatively short subsequences that exhibit high self-similarity yet are distinguishable from other subsequences of the data. Typically, the occurrence of a motif corresponds to some meaningful aspect of the data such as a particular structure or binding site in biological sequences, a spoken word in speech data, or a specific robot behavior or response pattern. We address the problem of activity discovery, which deals with locating and modeling motifs in multivariate time series such as those captured by on-body sensors or from a video camera observing people engaged in some activity. We extend previous work in motif discovery to derive an algorithm that handles non-linear time warping, variable-length motifs, and which is efficient even when the motif occurrences are sparse relative to the full dataset. In bioinformatics, systems such MEME [1] were developed to discover motifs in DNA and protein sequences, while Jensen et al. [4] recently generalized motif discovery over both categorical and continuous data and across arbitrary similarity metrics. These algorithms were developed for sequences, however, and do not account for the dynamic nature of time series data. Within the data mining community, an efficient, probabilistic algorithm for motif discovery using locality-sensitive hashing was developed [2]. This approach only discovers fixed-length motifs in univariate data, however. Tanaka and Uehara generalized the approach to work with multivariate time series and to allow variable length motifs but not time warping [6]. Their solution is simply to use a univariate algorithm with the first principal component of the time series, a transformation that will often mask many motifs. Finally, the PERUSE algorithm discovers motifs in multivariate time series and allows non-linear time warping and variable-length motifs [5]. This approach, however, assumes that the motifs are densely distributed and is not efficient for sparse data. Our approach to activity discovery proceeds in three main phases: (1) motif seed discovery via analysis of a quantized representation, (2) seed refinement in the continuous domain, and (3) occurrence detection using probabilistic models trained from the refined seeds. Fundamentally, activity discovery is difficult because little information is known about the motifs ahead of time. Specifically, the discovery system does not know the number of motifs, the location or length of the occurrences, or the shape of each motif. In order to deal with this lack of knowledge and still discover the motifs efficiently, our approach transforms the continuous, multivariate time series into strings of discrete symbols and utilizes a generalized suffix tree for linear-time subsequence searches [3]. Each unique subsequence with a user-specified length is used as a query to retrieve all of the occurrences in the dataset while allowing for dynamic time warping. The motif representing the most information, accounting for both the motif complexity and the number of occurrences, is selected and removed. This process repeats until the amount of information represented by the best motif is too small. Our algorithm then refines this set of initial seed motifs. Refinement consists of merging, splitting, and temporal extension. In the splitting phase, the occurrences of each seed motif are analyzed using agglomerative clustering to determine if the motif is actually a combination of two different motifs. The merging phase
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要