CaltechTHESIS
  A Caltech Library Service

Graph Clustering: Algorithms, Analysis and Query Design

Citation

Korlakai Vinayak, Ramya (2018) Graph Clustering: Algorithms, Analysis and Query Design. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/Z9RR1WFK. https://resolver.caltech.edu/CaltechTHESIS:09222017-130217881

Abstract

A wide range of applications in engineering as well as the natural and social sciences have datasets that are unlabeled. Clustering plays a major role in exploring structure in such unlabeled datasets. Owing to the heterogeneity in the applications and the types of datasets available, there are plenty of clustering objectives and algorithms. In this thesis we focus on two such clustering problems: Graph Clustering and Crowdsourced Clustering.

In the first part, we consider the problem of graph clustering and study convex-optimization-based clustering algorithms. Datasets are often messy -- ridden with noise, outliers (items that do not belong to any clusters), and missing data. Therefore, we are interested in algorithms that are robust to such discrepancies. We present and analyze convex-optimization-based clustering algorithms which aim to recover the low-rank matrix that encodes the underlying cluster structure for two clustering objectives: clustering partially observed graphs and clustering similarity matrices with outliers. Using block models as generative models, we characterize the performance of these convex clustering algorithms. In particular, we provide explicit bounds, without any large unknown constants, on the problem parameters that determine the success and failure of these convex approaches.

In the second part, we consider the problem of crowdsourced clustering -- the task of clustering items using answers from non-expert crowd workers who can answer similarity comparison queries. Since the workers are not experts, they provide noisy answers. Further, due to budget constraints, we cannot make all possible comparisons between items in the dataset. Thus, it is important to design queries that can reduce the noise in the responses and design algorithms that can work with noisy and partial data. We demonstrate that random triangle queries (where three items are compared per query) provide less noisy data as well as greater quantity of data, for a fixed query budget, as compared to random edge queries (where two items are compared per query). We extend the analysis of convex clustering algorithms to show that the exact recovery guarantees hold for triangle queries despite involving dependent edges. In addition to random querying strategies, we also present a novel active querying algorithm that is guaranteed to find all the clusters regardless of their sizes and without the knowledge of any parameters as long as the workers are better than random guessers. We also provide a tight upper bound on the number of queries made by the proposed active querying algorithm. Apart from providing theoretical guarantees for the clustering algorithms we also apply our algorithms to real datasets.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:Graph clustering, Crowdsourcing, Stochastic Block Model, Convex optimization, Community detection, Similarity clustering, Clustering with missing data, Active querying
Degree Grantor:California Institute of Technology
Division:Engineering and Applied Science
Major Option:Electrical Engineering
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Hassibi, Babak
Thesis Committee:
  • Hassibi, Babak (chair)
  • Wierman, Adam C.
  • Perona, Pietro
  • Chandrasekaran, Venkat
  • Yue, Yisong
Defense Date:21 August 2017
Non-Caltech Author Email:ramya.kvinayak (AT) gmail.com
Record Number:CaltechTHESIS:09222017-130217881
Persistent URL:https://resolver.caltech.edu/CaltechTHESIS:09222017-130217881
DOI:10.7907/Z9RR1WFK
Related URLs:
URLURL TypeDescription
https://doi.org/10.1109/ICASSP.2014.6855219DOIAdapted to chapter 2.
http://papers.nips.cc/paper/5309-graph-clustering-with-missing-data-convex-algorithms-and-analysis.pdfPublisherAdapted to chapter 2.
https://doi.org/10.1109/ISIT.2016.7541267DOIAdapted to chapter 3.
http://papers.nips.cc/paper/6499-crowdsourced-clustering-querying-edges-vs-triangles.pdfPublisherAdapted to chapter 4.
https://doi.org/10.1109/ICASSP.2017.7952571DOIAdapted to chapter 5.
ORCID:
AuthorORCID
Korlakai Vinayak, Ramya0000-0003-0248-9551
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:10447
Collection:CaltechTHESIS
Deposited By: Ramya Korlakai Vinayak
Deposited On:26 Sep 2017 21:27
Last Modified:04 Oct 2019 00:17

Thesis Files

[img]
Preview
PDF - Final Version
See Usage Policy.

4MB

Repository Staff Only: item control page