Clustering and K-Means Algorithm

Clustering is a type of unsupervised leraning. Unsupervised learning is contrasted from supervised learning because it uses an unlabeled training set rather than a labeled one.

Clustering is good for market segmentation, social network analysis, organizing computer clusters, astronomical data analysis, etc…

K-Means Algorithm

The K-Means Algorithm is the most popular and widely used algorithm for automatically grouping data into coherent subsets.

Notations

$c^{(i)}$ = index of cluster to which example $x^{(i)}$ is currently assigned

$\mu_k$ = cluster centroid (k=1,2,…,K)

$\mu_{c^{(i)}}$ = cluster centroid of cluster to which example $x^{(i)}$ has been assigned

Optimization Objective

Our optimization objective is to minimize the following cost function:

Algorithm

k-means algorithnm

In order to minimize this cost function, we follow these steps:

  1. Random Initialization
    From the training examples, randomly pick $K$ points called the cluster centroids($c^{(1)}…c^{(K)}$). Make sure the number of your clusters($K$) is less than the number of your training examples($m$).

  2. Cluster assignment
    Assign all examples into one of $K$ groups based on which cluster centroid the example is closest to.
    The goal is to minimize cost $J$ with $c$ while holding $\mu$ fixed.

  3. Move centroid
    Compute the averages for all the points inside each of the $K$ cluster centroid groups, then move the cluster centroid points to those averages.
    The goal is to minimize cost $J$ with $\mu$ while holding $c$ fixed.

  4. Re-run (2) and (3) until we have found our clusters.

  5. Try out different random initialization and choose the result that has minimal cost. We do this because depending on the initialization, K-means can get stuck in local optima.

Note that with k-means, it is not possible for the cost function to sometimes increase. It should always descend.

Leave a Comment