Distribution patterns of over-represented k-mers in non-coding yeast DNA.

BIOINFORMATICS(2002)

引用 58|浏览9
暂无评分
摘要
Motivation: Over-represented k-mers in genomic DNA regions are often of particular biological interest. For example, over-represented k-mers in co-regulated families of genes are associated with the DNA binding sites of transcription factors. To measure over-representation, we introduce a statistical background model based on single-mismatches, and apply it to the pooled 500 bp ORF Upstream Regions (USRs) of yeast. More importantly, we investigate the context and spatial distribution of over-represented k-mers in yeast USRs. Results: Single and double-stranded spatial distributions of most over-rep resented k-mers are highly non-random, and predominantly cluster into a small number of classes that are robust with respect to over-representation measures. Specifically, we show that the three most common distribution patterns can be related to DNA structure, function, and evolution and correspond to: (a) homologous ORF clusters associated with sharply localized distributions; (b) regulatory elements associated with a symmetric broad hill-shaped distribution in the 50-200 bp USR; and (c) runs of As, Ts, and ATs associated with a broad hill-shaped distribution also in the 50-200 bp USR, with extreme structural properties. Analysis of over-representation, homology, localization, and DNA structure are essential components of a general data-mining approach to finding biologically important k-mers in raw genomic DNA and understanding the 'lexicon' of regulatory regions. Contact: hampson@ics.uci.edu; kibler@ics.uci.edu; pfbaldi@ics.uci.edu.
更多
查看译文
关键词
genomic dna,data mining,transcription factor,dna structure
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要