## Saturday, May 11, 2013

### Time & Space Complexity of Basic K-means Algorithm

The basic k-means clustering algorithm is a simple algorithm that separates the given data space into different clusters based on centroids calculation using some proximity function. Using this algorithm, we first choose the k- points as initial centroids and then each point is assigned to a cluster with the closest centroid. The algorithm is formally described as follows:
Input: A data set D containing m objects (points) with n attributes in an              Euclidean space
Output: Partitioning of m objects into k-clusters C1, C2, C3, …, Cki.e. Ci  D and Ci ∩ Cj = ᶲ  (for 1 ≤ i, j ≤ k)

#### Procedure:

1. Select k points as initial centroid
2. Repeat
3. Form k clusters by assigning each point to its closest centroid
4.  Re-compute the centroid of each cluster
5. Until the centroids don’t change

#### Time Complexity of the algorithm:

To analyze the complexity of the algorithm in terms of time required, we need to identify the operations required in each step and in each iteration of the algorithm. Here, for k-means algorithm, the operations for each iteration are:
• Calculation of distances: To calculate the distance from a point to the centroid, we can use the squared Euclidean proximity function. For this, we need two subtractions, one summation, two multiplications and one square - root operations, i.e., 6-operations.
• Comparisons between distances
• Calculation of centroids

So, the total number of operations for an iteration
= distance calculation operations + comparison operations + centroids calculation operations
= 6*[k*m*n] operations + [(k-1)*m*n] operations + [k*((m-1) + 1)*n] operations

Assume that the algorithm converges after I iterations, then total number of operations for the algorithm
= 6*[I*k*m*n] operations + [I*(k-1)*m*n] operations + [I*k*((m-1) + 1)*n] operations
= [6Ikmn + I(k-1)mn + Ikmn] operations
≈ O (I*k*m*n)

Here, we’ve applied the basic assumption of each operation requiring the same amount of time. Since, normally, k << m & n << m, the analysis shows that the k-means is linear in m, the number of points, and is scalable and efficient in processing large data sets.

#### Space Complexity of the Algorithm:

The space requirement for k-means are modest because only the data points and centroids are needed to be stored. Specifically, the storage required is O((m+k)n), where m is the number of objects (points) & n is the number of attributes considering n-dimensional objects.