Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Computing Topological Features for Data Analysis

Abstract Details

2017, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Topological data analysis (TDA) provides a new methodology to data analysis problems. It captures intrinsic topological structures in data, which can then offer useful guidelines for other data analysis approaches. One main task in TDA is to extract topological features from data. And this is often done through one of the main tools of TDA, persistent homology, or simply, persistence. While the idea of persistent homology has been demonstrated valuable in a broad range of areas, its application to large scale real-world data is limited by the efficiency of existing persistence computation approaches. I will present a focused study during my PhD research on broadening applicability of the idea of persistence in data analysis in two fronts, to explore novel ways of applying persistent homology for qualitative data analysis and to study the computational issues related to persistent homology, so as to significantly broaden their applications to large-scale low or high dimensional data analysis. In the first direction, we applied persistent homology to a special kind of data, called \emph{metric graphs}. A metric graph offers one of the simplest yet still meaningful ways to represent the non-linear structure hidden behind the data. Thus, comparing two graphs has many important applications in data analysis. In this work, we proposed a new distance between two finite metric graphs, called the \spdist{} distance, which formulates the graph matching problem through persistent homology. Our \spdist{} distance has two properties not shared by previous methods: First, it is stable against the perturbations of the input graph metrics. Second, it is a \emph{continuous} distance measure, in the sense that it is defined on an alignment of the underlying spaces of input graphs, instead of merely their nodes. This makes our \spdist{} distance robust against, for example, different discretizations of the same underlying graph. Despite considering the input graphs as continuous spaces, we showed that we can still compute the \spdist{} distance in polynomial time. The time complexity for the discrete case where only graph nodes are considered is much faster. We also provided some preliminary experimental results to demonstrate the use of the new distance measure. The software is publicly available at \cite{comptogo}. Besides this graph comparison work, we also did another preliminary work of studying the concept of \emph{Multiscale Mapper}. We implemented a basic software and applied it to some volumetric data to get better understanding of the \emph{Multiscale Mapper}. In the second part, we consider the more general case, high-dimensional point cloud data. To extract topological features of a point cloud data sampled from a metric space, a sequence of Rips complexes built on $P$ indexed by a scale parameter is mostly used. Unfortunately, even for input with moderate size, the size of the Rips complex may become prohibitively large as the scale parameter increases. Starting with the \emph{Sparse Rips filtration} introduced by Sheehy \cite{S2012}, some existing methods aim to reduce the size of the complex so as to improve the time efficiency as well. However, existing approaches still fall short of scaling well, especially for high dimensional data. In this work, we investigated the advantages and limitations of existing approaches. Based on insights gained from the experiments, we proposed an efficient new algorithm, called \emph{SimBa}, for approximating the persistent homology of Rips filtrations with quality guarantees. Our new algorithm leverages a batch collapse strategy as well as a new sparse Rips-like filtration. We experimented with a variety of low and high dimensional data sets. We showed that our strategy presents a significant size reduction, and our algorithm for approximating Rips filtration persistence is an order of magnitude faster than existing methods in practice. In addition, we also implemented SimPers which is the only available software to compute persistence for simplicial maps.
Tamal Dey, Dr. (Advisor)
Wang Yusu, Dr. (Advisor)
Wenger Rephael, Dr. (Committee Member)
Bozanic Zahn, Dr. (Committee Member)
123 p.

Recommended Citations

Citations

  • Shi, D. (2017). Computing Topological Features for Data Analysis [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1512079255367232

    APA Style (7th edition)

  • Shi, Dayu. Computing Topological Features for Data Analysis. 2017. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1512079255367232.

    MLA Style (8th edition)

  • Shi, Dayu. "Computing Topological Features for Data Analysis." Doctoral dissertation, Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1512079255367232

    Chicago Manual of Style (17th edition)