Unsupervised learning

Introduction to Clustering

In machine learning:

  • Classification into different categories is referred to as classification problem
  • Partition into similar parts is referred to as clustering problem
  • Prediction is referred to as regression problem

An example of Clustering in Image Quantization

Partition Definition

C_1,C_2,\ldots,C_K\mathrm{\ is\ a\ partition\ of\ }{1,2,\ldots,n} only if:C_1\cup C_2\cup\ldots\cup C_K=\left{1,2,\ldots,n\right} and C_i\cap C_j=\emptyset\mathrm{\ for\ any\ }i\neq j\mathrm{\ in\ }\left{1,\ldots,k\right}

Example:

{3},{1},{2}
{2,1},{3}
{2,3,1}

Are all partitions of {1, 2, 3}

Clustering is unsupervised learning, which means we do not have labels to start from. Now the goal is much less specific and more focused on the big picture. The goal is not to predict to which class each data will fall into but to visualize data that we do not have much sense of. As in the example of partitioning Google News articles, clustering is used in settings where we do not have much information on data and would like to visually get a sense of how many groupings the data consists of. Thus, clustering takes Set of feature vectors S_n=\left{x^{(i)}\mid i=1,\ldots,n\right} and the number of clusters K as input.

Clustering outputs (1) partitions of data indices into each of the K clusters and (2) representative in each cluster. Remark: Deciding a good partition and good representatives are two tasks that are intertwined. In the next sections, we discuss (1) how to measure the “goodness” of a certain assignment and representative selection and (2) how to collectively find the good assignments and good representatives.

Clustering similar colors in the following example:

The image was compressed after clustering into K=2 clusters.

Similarity Measures-Cost functions

Goal: which clustering output is preferable?

To answer this question, we should define a measure of homogeneity inside cluster assignments and compare the measure of each scenario.

Define Costs

\operatorname{Cost}\left(C_{1}, \ldots, C_{K}\right)=\sum_{j=1}^{K} \operatorname{Cost}\left(C_{j}\right)

Cost{\left(C_j\right)} measures “how homogeneous” the assigned data are inside the jth cluster Cj. it can be:

  • The diameter of a cluster
  • The average distance between points inside a cluster
  • The sum of the distance between the representative and all points inside a cluster

Example:

Let S_n=\left{x_1,\ldots,x_5\right} and the number of clusters K=2:

\operatorname{Cost}\left(C_{1}, \ldots, C_{K}\right)=\sum_{j=1}^{K} \operatorname{Cost}\left(C_{j}\right)=\sum_{j=1}^{K} \sum_{i \in C_{j}}\left|x_{i}-z_{j}\right|^{2}
\operatorname{Cost}\left(C_{1}, C_{2}\right)=3+2=5

The K-Means Algorithm: The Big Picture

Given a set of feature vectors:

S_{n}=\left{x^{(i)} \mid i=1, \ldots, n\right}

The clustering was given as follows:

Randomly select z_1,\ldots,z_K

Iterate: Assign x^{(i)}\mathrm{\ to\ the\ closest\ }z_j so that:

\operatorname{Cost}\left(z_{1}, \ldots z_{K}\right)=\sum_{i=1}^{n} \min <em>{j=1, \ldots, K}\left|x^{(i)}-z</em>{j}\right|^{2}

Given C_1,\ldots,C_K find the best z_1,\ldots,z_K such that:

z_{j}=\operatorname{argmin}{z} \sum{i \in C_{j}}\left|x^{(i)}-z\right|^{2}

To find the best representative z_j\mathrm{\ for\ the\ cluster\ }\left{x^{(i)}\right}_{i\in\mathbb{C}_j} we calculate:

\nabla_{z_{j}}\left(\sum_{i \in \mathbb{C}{j}}\left|x^{(3)}-z{j}\right|^{2}\right)=\sum_{i \in \mathbb{C}{j}}-2\left(x^{(i)}-z{j}\right)

Find zj that minimize \sum_{i\in\mathbb{C}_j}\hairsp\left\parallel x^{(i)}-z_j\right\parallel^2:

\frac{\sum i \in C_{j} x^{(i)}}{\left|C_{j}\right|}

The value zj (the centroid) is only affected by \left{x_i:i\in C_j\right}.

K-Means Drawbacks

  • Manual choice of K
  • Not robust to outliers
  • Does not scale well with increasing number of dimensions
  • It is not guaranteed that z_{1}, \ldots, z_{K} \in\left{x_{1}, \ldots, x_{n}\right}

Introduction to the K-Medoids Algorithm

K-Medoids Algorithm as a Variation of K-Means – the difference:

Step 2.2:

Given C_j\in\left{C_1,\ldots,C_K\right}, we find the best representative z_j\in\left{x_1,\ldots,x_n\right} such that: \sum_{x^{(i)}\in C_j}\hairspdist\funcapply\left(x^{(i)},z_j\right)

It is always guaranteed that z_1,\ldots,z_K\in\left{x_1,\ldots,x_n\right}

Computational complexity of K-means and K-medoids

Step 2.1 of the K-Medoids is the same as that of K-Means, so the time complexity is \mathcal{O}(ndK). Note that step 2.2 of K-Medoids has an additional loop of iterating through the n points z_j\in\left{x_1,\ldots,x_n\right}\mathrm{\ which\ takes\ }\mathcal{O}(n). Thus step 2.2 takes \ \mathcal{O}\left(n^2dK\right).

Don’t miss these tips!

We don’t spam! Read our privacy policy for more info.