Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Scalable Clustering for Immune Repertoire Sequence Analysis

Abstract Details

2019, Master of Science (MS), Wright State University, Computer Science.
The development of the next-generation sequencing technology has enabled systems immunology researchers to conduct detailed immune repertoire analysis at the molecule level. Large sequence datasets (e.g., millions of sequences) are being collected to com- prehensively understand how the immune system of a patient evolves over different stages of disease development. A recent study has shown that the hierarchical clustering (HC) algorithm gives the best results for B-cell clones analysis - an important type of immune repertoire sequencing (IR-Seq) analysis. However, due to the inherent complexity, the classical hierarchical clustering algorithm does not scale well to large sequence datasets. Surprisingly, no algorithms have been developed to address this scalability issue for im- munology research. In this thesis, we study two different strategies, aiming at finding the best scalable methods that can preserve the quality of hierarchical clustering structure. The two strategies include (1) non-Euclidean indexing methods for speeding up the clas- sical hierarchical clustering(HC), (2) a new tree-based sequence summarization approach - SCT that scans the large sequence dataset once and generates summaries for hierarchi- cal clusters(HC). And we also experimented with the Spark based minimum-spanning-tree algorithm (SparkMST) that generates the equivalent result of single linkage hierarchical clustering (SLINK) for comparative analysis. We have implemented all these algorithms and experimented with real sequence datasets for B-cell clones analysis. The result shows that (1) the indexing-enhanced HC (e.g., us- ing the Vantage-Point tree for indexing) preserves the clustering quality very well, while also significantly reducing the time complexity of the original HC; (2) SCT with HC is the fastest approximate HC method with slightly sacrificed quality; and (3) SparkMST scales out satisfactorily and gives significant performance gain with a large Spark cluster.
Keke Chen, Ph.D. (Advisor)
Krishnaprasad Thirunarayan, Ph.D. (Committee Member)
Tanvi Banerjee, Ph.D. (Committee Member)
50 p.

Recommended Citations

Citations

  • Bhusal, P. (2019). Scalable Clustering for Immune Repertoire Sequence Analysis [Master's thesis, Wright State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=wright1558631347622374

    APA Style (7th edition)

  • Bhusal, Prem. Scalable Clustering for Immune Repertoire Sequence Analysis. 2019. Wright State University, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=wright1558631347622374.

    MLA Style (8th edition)

  • Bhusal, Prem. "Scalable Clustering for Immune Repertoire Sequence Analysis." Master's thesis, Wright State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=wright1558631347622374

    Chicago Manual of Style (17th edition)