Scalable Sequence Clustering for Large-Scale Immune Repertoire Analysis.

IEEE BigData(2021)

引用 0|浏览14
暂无评分
摘要
The development of the next-generation sequencing technology has enabled systems immunology researchers to conduct detailed immune repertoire analysis at the molecular level that allows researchers to understand the healthiness of a patient's immune system. Recent studies have shown that the single-linkage clustering algorithm can give the best results for B cell clonality analysis - a critical type of immune repertoire sequencing (IR-Seq) analysis. Large sequence datasets (e.g., millions of sequences) are being collected to comprehensively understand how a specific person's immune system evolves over different stages of disease development. However, the classical single-linkage clustering algorithm does not scale well to such large sequence datasets. Surprisingly, no study has been done to address this scalability issue for immunology research and development. We study three different strategies to scale up the single-linkage algorithm for sequence data. They include (1) the approximate single-linkage algorithm enhanced with the non-Euclidean indexing methods, (2) the Spark-based single-linkage algorithm (SparkMST) that was originally designed for vector data and now modified for sequence data, and (3) a new tree-based sequence summarization approach - SCT that aims to reduce the data for single-linkage clustering with well-preserved clustering quality. We have implemented these approaches and experimented with real sequence datasets for B cell clonality analysis. (1) The index-enhanced hierarchical clustering algorithm (e.g., VPT-HC using the Vantage-Point tree for indexing) preserves the clustering quality very well while significantly reducing the time complexity. (2) The SCT approach serving as a preprocessing step can effectively reduce data size for clustering. The overall clustering, SCT followed by VPT-HC, is the fastest among the evaluated single-machine algorithms. However, this approach also slightly affects the clustering quality. (3) The SparkMST parallel algorithm scales out nicely and also gives exact single-linkage clustering results. However, SparkMST is tied to the single-linkage algorithm and cannot be extended to general hierarchical clustering algorithms. Although this study focused on the specific application area: the B cell clonality analysis, we believe other sequence data analysis problems may find the developed scalable techniques useful.
更多
查看译文
关键词
clustering,sequence data,scalability,parallel processing,summarization,indexing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要