Handling of Spurious Sequences Affects the Outcome of High-Throughput 16S rRNA Gene Amplicon Profiling

Research Square (Research Square)(2020)

引用 0|浏览0
暂无评分
摘要
Abstract Background: 16S rRNA gene amplicon sequencing is a very popular approach for studying microbiomes. However, varying standards exist for sample and data processing and some basic concepts, such as the occurrence of spurious sequences, have not been investigated in a comprehensive manner. Methods: Using defined communities of bacteria in vitro and in vivo, we searched for sequences not matching the expected species (i.e., spurious taxa) and determined a minimum threshold of occurrence suitable for robust data analysis. The presence and origin of spurious taxa were investigated via large-scale amplicon queries and gut samples from germfree mice spiked with target mock DNA. We also assessed the effect of varying sequence-filtering stringency on diversity readouts in human fecal and peat soil communities. Our findings are based on data generated in three sequencing facilities and analyzed via both operational taxonomic units (OTUs) and amplicon sequence variants (ASVs) approaches.Results: 16S rRNA gene amplicon data-processing based on OTUs clustering and singleton removal, a commonly used approach that discards any taxa represented by only one sequence across all samples, delivered an average approximately 50% (mock communities) to 80% (gnotobiotic mice) spurious taxa. The fraction of spurious taxa was generally lower based on ASV analysis, but varied depending on the gene region targeted and the barcoding system used. A relative abundance of 0.25% was found as an effective threshold below which the analysis of spurious taxa can be prevented to a large extent in both OTU- and ASV-based analysis approaches. Most spurious taxa (approx. 70%) detected in simplified communities occurred in samples multiplexed in the same sequencing run and were present in only one of ten runs. DNase treatment of gut content from germfree mice partly helped to exclude spurious taxa from the analysis of spiked mock DNA, but was not necessary when applying the 0.25% relative abundance threshold. Using this cut-off improved the reproducibility of analysis, i.e., specifically by reducing variation in richness estimates by 38% compared with singleton filtering in a benchmarking experiment using six human fecal samples across seven sequencing runs. Beta-diversity analyses of human fecal communities was markedly affected by both the filtering strategy and the type of phylogenetic distances used for comparing samples, highlighting the importance of carefully analyzing data before drawing conclusions. Conclusions: Handling of artifact sequences during bioinformatic processing of 16S rRNA gene amplicon data requires careful attention to avoid the generation of misleading findings. Applying a minimum relative abundance threshold between 0.10 and 0.30% is superior to the singleton removal approach, although study-specific analysis strategies may be needed depending on, for instance, the type of samples analyzed and the sequencing depth achieved. Additionally, we propose the concept of effective richness to facilitate the comparison of results across studies.
更多
查看译文
关键词
gene,spurious sequences,high-throughput
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要