Contiguity: Contig adjacency graph construction and visualisation

mag(2015)

引用 13|浏览7
暂无评分
摘要
12 Contiguity is interactive software for the visualization and manipulation of de novo genome assemblies. 13 Contiguity creates and displays information on contig adjacency which is contextualized by the 14 simultaneous display of a comparison between assembled contigs and reference sequence. Where 15 scaffolders allow unambiguous connections between contigs to be resolved into a single scaffold, 16 Contiguity allows the user to create all potential scaffolds in ambiguous regions of the genome. This 17 enables the resolution of novel sequence or structural variants from the assembly. In addition, 18 Contiguity provides a sequencing and assembly agnostic approach for the creation of contig adjacency 19 graphs. To maximize the number of contig adjacencies determined, Contiguity combines information 20 from read pair mappings, sequence overlap and De Bruijn graph exploration. We demonstrate how 21 highly sensitive graphs can be achieved using this method. Contig adjacency graphs allow the user to 22 visualize potential arrangements of contigs in unresolvable areas of the genome. By combining 23 adjacency information with comparative genomics, Contiguity provides an intuitive approach for 24 exploring and improving sequence assemblies. It is also useful in guiding manual closure of long read 25 sequence assemblies. Contiguity is an open source application, implemented using Python and the 26 Tkinter GUI package that can run on any Unix, OSX and Windows operating system. It has been 27 designed and optimized for bacterial assemblies. Contiguity is available at 28 http://mjsull.github.io/Contiguity . 29 Introduction 30 The emergence of high-throughput sequencing technologies has led to a massive increase in the number 31 of unassembled or draft bacterial genome sequence data sets [1]. De novo assembly of sequencing reads 32 produced using high-throughput sequencing methods often results in highly fragmented assemblies 33 containing hundreds of contiguous sequences (contigs). Although long reads, such as those produced 34 by Pacific Bioscience's single molecule real time sequencing (SMRT), significantly reduce 35 fragmentation in bacterial genome assemblies, they frequently do not assemble into a single contig [2]. 36 Consequently, contig ordering, scaffolding, identification of spurious or misassembled contigs and 37 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1037v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015 P re P rin ts comparative analysis of an assembly all remain time-limiting steps during the analysis of a de novo 38 assembly. 39 Several tools exist that allow easy visualization of pairwise or multiple alignments, including Easyfig 40 [3], Artemis Comparison Tool [4], genoPlotR [5], Interactive Genomics Viewer [6] and Mauve [7]. 41 These tools allow the rapid identification of structural variations between two sequences such as 42 rearrangements, insertions, and deletions. Many of these events may be biologically important and can 43 be a result of prophages, plasmids and other mobile genetic elements. Such events account for much of 44 the variation in bacterial species such as Escherichia coli [8]. However, mobile genetic elements are 45 relatively difficult to resolve in draft or metagenome assemblies primarily due to an abundance of 46 insertion sequences within these elements that result in collapsed repeats and a lack of specific 47 information about contig adjacency. Mobile genetic elements often assemble into several contigs 48 making it unclear whether several contigs with novel sequence are part of the same mobile genetic 49 element, or belong to several distinct elements. 50 In theory, mobile genetic elements and other difficult to assemble genomic regions can be reconstructed 51 by examining contig interconnectivity within an assembly. By determining which contigs are adjacent 52 to one another in the underlying assembly graph, potential arrangements of those contigs in context of 53 the complete genome can be determined. This allows the use of synteny to contextualise sequence that 54 is not present in complete reference genomes and can also help determine the sequence of genomic 55 regions that span multiple contigs. Adjacency information can also be used to group contigs into distinct 56 elements, such as chromosomal and extra-chromosomal DNA. This approach is used by PLACNET [9] 57 to identify plasmid contigs in de novo assembled genomes. PLACNET creates an undirected graph of 58 contig adjacencies that can be visualized with a tool such as Cytoscape [10]. Using such an approach, 59 specific information about order and orientation of contigs in the plasmid, relative to one another, cannot 60 be inferred. Several methods exist for finding interconnectivity between contigs, such as looking at 61 paired-end reads shared by contigs in a de novo assembly or using transcript data. This information can 62 be leveraged by scaffolding algorithms, such as SOPRA [11] and SSPACE [12], to improve de novo 63 assemblies by joining connected contigs where no ambiguity exists. However, scaffolding can introduce 64 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1037v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015 P re P rin ts errors into assemblies and provides no information about potential adjacencies between contigs in 65 regions that are unable to be resolved, such as repetitive regions of the chromosome. Interconnectivity 66 can also be visualized using programs such as Consed [13], Phrapview [14], Abyss-explorer [15], TGnet 67 [16], ContigScape [17] and Bandage [18]. Consed and Phrapview display a linear relationship between 68 contigs with connections between contigs being inferred from paired reads. Abyss-explorer, TGnet and 69 ContigScape display assemblies as a directed graph. Abyss-explorer infers connectivity from graph 70 information and read pair information provided by the De Bruijn assembler Abyss [19]. TGnet finds 71 adjacencies using transcript information, and Contigscape infers adjacencies by identifying reads shared 72 between contigs assembled by the “Newbler Assembler” or connectivity using paired reads. Bandage 73 can be used to visualize the LastGraph file produced by the Velvet assembler [20], FASTG files and 74 Trinity.fasta files produced by the RNA-seq assembler Trinity [21]. 75 These methods, described above, are limited to creating graphs from specific data types that are not 76 always available to the end user. Alternatively, they require the use of a specific assembly program, 77 which may result in a suboptimal assembly. Graphs based on the output of an assembler also prevent 78 the user from performing additional optimization of their assemblies, such as scaffolding or 79 misassembly correction. Assemblies often result in hundreds of contigs, with each contig typically 80 having between 2 to 4 connections to other contigs. Although small assemblies can be displayed 81 concisely, as assembly size grows visual representations of the graph can quickly become cluttered 82 making it difficult to extract meaningful information. 83 Contiguity makes contig adjacency graphs more accessible to users unfamiliar with the concept. 84 Contiguity adds sequence comparison to the visualization of contig adjacency graphs. This allows the 85 user to contextualize contig adjacency information with similarity and the order inferred from a 86 reference sequence. The user can quickly and easily identify genome rearrangements, insertions, 87 deletions and potential misassemblies. Contiguity includes a purpose built contig adjacency graph 88 creation algorithm that combines existing approaches and allows adjacency graphs to be built from any 89 assembly irrespective of sequencing or assembly method. In addition to a description of Contiguity’s 90 core functionality (the construction and representation of contig adjacency graphs) we also provide two 91 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1037v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015 P re P rin ts case studies of existing projects in which Contiguity has been used to improve assembly and elucidate 92 structural rearrangements. Contiguity is an open source project implemented in Python using the 93 Tkinter graphical user interface library, it available on Windows, OSX and GNU/Linux. 94 Methods and results 95 Contiguity overview 96 Contiguity is designed to enable the visualization and organization of de novo assemblies. It allows both 97 comparison information and contig adjacency graph information to be visualized simultaneously using 98 the same BLAST comparison format used in tools such as Artemis Comparison Tool [4] and Easyfig 99 [3] (Figure 1). 100
更多
查看译文
关键词
comparative genomics,bioinformatics,computational biology,genetics,computer graphics
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要