Straintables: An application that extracts sequences from genome assemblies and generates dissimilarity matrices

bioRxiv (Cold Spring Harbor Laboratory)(2021)

引用 0|浏览0
暂无评分
摘要
Abstract Background and Objectives The dissimilarity matrix (DM) is an important component of phylogenetic analysis, and many software packages exist to build and show DMs. However, as the common input for this type of software are sequences in FASTA file format, the process of extracting and aligning each set of sequences to produce a big number of matrices can be laborious. Additionally, existing software do not facilitate the comparison of clusters of similarity across several DMs built for the same group of individuals, using different genomic regions. To address our requirements of such a tool, we designed Straintables to extract specific genomic region sequences from a group of intraspecies genomic assemblies, using extracted sequences to build dissimilarity matrices. Methods A Python module with executable scripts was developed for a study on genetic diversity across strains of Toxoplasma gondii , being a general purpose system for DM calculation and visualization for preliminary phylogenetic studies. For automatic region sequence extraction from genomic assemblies we assembled a system that designs virtual primers using reference sequences located at genomic annotations, then matches those primers on genome files by using regex patterns. Extracted sequences are then aligned using Clustal Omega and compared to generate matrices. Results Using this software saves the user from manual preparation and alignment of the sequences, a process that can be laborious when a large number of assemblies or regions are involved. The automatic sequence extraction process can be checked against BLAST results using extracted sequence as queries, where correct results were observed for same-species pools for various organisms. The package also contains a matrix visualization tool focused on cluster visualization, capable of drawing matrices into image files with custom settings, and features methods of reordering matrices to facilitate the comparison of clustering patterns across two or more matrices. Conclusion Straintables may replace and extend the functionality of existing matrix-oriented phylogenetic software, featuring automatic region extraction from genomic assemblies and enhanced matrix visualization capabilities emphasizing cluster identification. This module is open source, available at GitHub ( https://github.com/Gab0/straintables ) under a MIT license and also as a PIPY package. Highlights Simple in-silico protocol for generation, visualization and comparison of dissimilarity matrices. Accurate automatic sequence extraction from multiple genomic assemblies by using virtual primers built from reference sequences in an annotation file. Draws matrices as images, with enhanced cluster visualization and customized options. Supports reordering of matrix indices to better visualize clustering pattern conservation across multiple regions.
更多
查看译文
关键词
genome assemblies,sequences
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要