Generating Features Using Burrows Wheeler Transformation For Biological Sequence Classification

BIOSTEC 2014: Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3(2014)

引用 2|浏览6
暂无评分
摘要
Recent advancements in biological sciences have resulted in the availability of large amounts of sequence data (both DNA and protein sequences). The annotation of biological sequence data can be approached using machine learning techniques. Such techniques require that the input data is represented as a vector of features. In the absence of biologically known features, a common approach is to generate k-mers using a sliding window. A larger k value usually results in better features; however, the number of k-mer features is exponential in k, and many of the k-mers are not informative. Feature selection techniques can be used to identify the most informative features, but are computationally expensive when used over the set of all k-mers, especially over the space of variable length k-mers (which presumably capture better the information in the data). Instead of working with all k-mers, we propose to generate features using an approach based on Burrows Wheeler Transformation (BWT). Our approach generates variable length k-mers that represent a small subset of k-mers. Experimental results on both DNA (alternative splicing prediction) and protein (protein localization) sequences show that the BWT features combined with feature selection, result in models which are better than models learned directly from k-mers. This shows that the BWT-based approach to feature generation can be used to obtain informative variable length features for DNA and protein prediction problems.
更多
查看译文
关键词
Burrows Wheeler Transformation,Machine Learning,Supervised Learning,Feature Selection,Dimensionality Reduction,Biological Sequence Classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要