Introduction to Clustering
In machine learning:
- Classification into different categories is referred to as classification problem
- Partition into similar parts is referred to as clustering problem
- Prediction is referred to as regression problem
An example of Clustering in Image Quantization
Partition Definition
Example:
Are all partitions of {1, 2, 3}
Clustering is unsupervised learning, which means we do not have labels to start from. Now the goal is much less specific and more focused on the big picture. The goal is not to predict to which class each data will fall into but to visualize data that we do not have much sense of. As in the example of partitioning Google News articles, clustering is used in settings where we do not have much information on data and would like to visually get a sense of how many groupings the data consists of. Thus, clustering takes Set of feature vectors and the number of clusters K as input.
Clustering outputs (1) partitions of data indices into each of the K clusters and (2) representative in each cluster. Remark: Deciding a good partition and good representatives are two tasks that are intertwined. In the next sections, we discuss (1) how to measure the “goodness” of a certain assignment and representative selection and (2) how to collectively find the good assignments and good representatives.
Clustering similar colors in the following example:
The image was compressed after clustering into K=2 clusters.
Similarity Measures-Cost functions
Goal: which clustering output is preferable?
To answer this question, we should define a measure of homogeneity inside cluster assignments and compare the measure of each scenario.
Define Costs
measures “how homogeneous” the assigned data are inside the jth cluster Cj. it can be:
- The diameter of a cluster
- The average distance between points inside a cluster
- The sum of the distance between the representative and all points inside a cluster
Example:
Let and the number of clusters K=2:
The K-Means Algorithm: The Big Picture
Given a set of feature vectors:
The clustering was given as follows:
Randomly select
Iterate: Assign so that:
Given find the best such that:
To find the best representative } we calculate:
Find zj that minimize :
The value zj (the centroid) is only affected by .
K-Means Drawbacks
- Manual choice of K
- Not robust to outliers
- Does not scale well with increasing number of dimensions
- It is not guaranteed that
Introduction to the K-Medoids Algorithm
K-Medoids Algorithm as a Variation of K-Means – the difference:
Step 2.2:
Given , we find the best representative such that:
It is always guaranteed that
Computational complexity of K-means and K-medoids
Step 2.1 of the K-Medoids is the same as that of K-Means, so the time complexity is . Note that step 2.2 of K-Medoids has an additional loop of iterating through the n points . Thus step 2.2 takes .