Distribution patterns of over-represented k-mers in non-coding yeast DNA.

Steven Hampson,Dennis Kibler,Pierre Baldi

BIOINFORMATICS（2002）

引用 58|浏览9

暂无评分

摘要

Motivation: Over-represented k-mers in genomic DNA regions are often of particular biological interest. For example, over-represented k-mers in co-regulated families of genes are associated with the DNA binding sites of transcription factors. To measure over-representation, we introduce a statistical background model based on single-mismatches, and apply it to the pooled 500 bp ORF Upstream Regions (USRs) of yeast. More importantly, we investigate the context and spatial distribution of over-represented k-mers in yeast USRs. Results: Single and double-stranded spatial distributions of most over-rep resented k-mers are highly non-random, and predominantly cluster into a small number of classes that are robust with respect to over-representation measures. Specifically, we show that the three most common distribution patterns can be related to DNA structure, function, and evolution and correspond to: (a) homologous ORF clusters associated with sharply localized distributions; (b) regulatory elements associated with a symmetric broad hill-shaped distribution in the 50-200 bp USR; and (c) runs of As, Ts, and ATs associated with a broad hill-shaped distribution also in the 50-200 bp USR, with extreme structural properties. Analysis of over-representation, homology, localization, and DNA structure are essential components of a general data-mining approach to finding biologically important k-mers in raw genomic DNA and understanding the 'lexicon' of regulatory regions. Contact: hampson@ics.uci.edu; kibler@ics.uci.edu; pfbaldi@ics.uci.edu.

查看译文

关键词

genomic dna,data mining,transcription factor,dna structure

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要