# Clustering and K-Means Algorithm

Categories:

Updated:

Clustering is a type of unsupervised leraning. Unsupervised learning is contrasted from supervised learning because it uses an unlabeled training set rather than a labeled one.

Clustering is good for market segmentation, social network analysis, organizing computer clusters, astronomical data analysis, etc…

# K-Means Algorithm

The K-Means Algorithm is the most popular and widely used algorithm for automatically grouping data into coherent subsets.

## Notations

$c^{(i)}$ = index of cluster to which example $x^{(i)}$ is currently assigned

$\mu_k$ = cluster centroid (k=1,2,…,K)

$\mu_{c^{(i)}}$ = cluster centroid of cluster to which example $x^{(i)}$ has been assigned

## Optimization Objective

Our optimization objective is to minimize the following cost function:

## Algorithm

In order to minimize this cost function, we follow these steps:

1. Random Initialization
From the training examples, randomly pick $K$ points called the cluster centroids($c^{(1)}…c^{(K)}$). Make sure the number of your clusters($K$) is less than the number of your training examples($m$).

2. Cluster assignment
Assign all examples into one of $K$ groups based on which cluster centroid the example is closest to.
The goal is to minimize cost $J$ with $c$ while holding $\mu$ fixed.

3. Move centroid
Compute the averages for all the points inside each of the $K$ cluster centroid groups, then move the cluster centroid points to those averages.
The goal is to minimize cost $J$ with $\mu$ while holding $c$ fixed.

4. Re-run (2) and (3) until we have found our clusters.

5. Try out different random initialization and choose the result that has minimal cost. We do this because depending on the initialization, K-means can get stuck in local optima.

Note that with k-means, it is not possible for the cost function to sometimes increase. It should always descend.

Categories: