Graphs (or networks) provide a simple yet powerful model for
capturing the interactions or relationships between the entities
in many different domains. This dissertation focusses on a
fundamental analytical or data management primitive related to
graphs - the discovery of natural groups or clusters from a
graph, or graph clustering in short. Due to the representational
versatility of graphs, this problem has varied applications - the
discovery of protein
complexes from protein interaction networks, data partitioning
for distributed computing, community discovery in
online social networks, general data clustering (via
similarity-weighted graphs), image segmentation, optimizing
transistor layout for VLSI design and so on.
The increasing scale, complexity and diversity of modern (graph)
data has rendered classical solutions for this
task, such as spectral clustering, either very slow, incomplete
or inaccurate. I address this situation in this thesis
by approaching the graph clustering problem from two directions -
in one direction, I develop algorithms for the
core clustering task; in the second direction, I build
intelligent pre-processing strategies that transform the graph
to enable fast and accurate clustering subsequently. Such a
two-pronged strategy allows us to concentrate on the
novel aspects of new variants of the clustering problem, while
exploiting existing solutions where we can. In the
direction of algorithms for the core clustering task, my
contribution to the state-of-the-art is a novel algorithm,
MLR-MCL, based on the multi-level simulation of stochastic flows.
MLR-MCL is a significantly enhanced
version of a popular existing algorithm called MCL, and is fast,
accurate, noise-tolerant and allows easy adjustment
of the balance in the output cluster sizes. In the direction of
intelligent pre-processing strategies, I have made three
important contributions: (i) Graph sparsification algorithms that
clarify the cluster structure and thereby speed-up
subsequent clustering, by upto 50x; (ii) Algorithms that
symmetrize directed graphs into similarity-weighted,
undirected graphs suitable for subsequent clustering; (iii)
Novel hashing-based algorithms that efficiently
convert general non-graph data into similarity-weighted graphs
suitable for clustering.