Time & Space Complexity of Basic K-means Algorithm

Saturday, May 11, 2013

Time & Space Complexity of Basic K-means Algorithm

The basic k-means clustering algorithm is a simple algorithm that separates the given data space into different clusters based on centroids calculation using some proximity function. Using this algorithm, we first choose the k- points as initial centroids and then each point is assigned to a cluster with the closest centroid. The algorithm is formally described as follows:

Input: A data set D containing m objects (points) with n attributes in an Euclidean space

Output: Partitioning of m objects into k-clusters C_1, C₂, C_3, …, C_k, i.e. C_i ⊂ D and C_i ∩ C_j = ᶲ (for 1 ≤ i, j ≤ k)

Procedure:

Select k points as initial centroid
Repeat
Form k clusters by assigning each point to its closest centroid
Re-compute the centroid of each cluster
Until the centroids don’t change

Time Complexity of the algorithm:

To analyze the complexity of the algorithm in terms of time required, we need to identify the operations required in each step and in each iteration of the algorithm. Here, for k-means algorithm, the operations for each iteration are:

Calculation of distances: To calculate the distance from a point to the centroid, we can use the squared Euclidean proximity function. For this, we need two subtractions, one summation, two multiplications and one square - root operations, i.e., 6-operations.
Comparisons between distances
Calculation of centroids

So, the total number of operations for an iteration

= distance calculation operations + comparison operations + centroids calculation operations

= 6*[k*m*n] operations + [(k-1)*m*n] operations + [k*((m-1) + 1)*n] operations

Assume that the algorithm converges after I iterations, then total number of operations for the algorithm

= 6*[I*k*m*n] operations + [I*(k-1)*m*n] operations + [I*k*((m-1) + 1)*n] operations

= [6Ikmn + I(k-1)mn + Ikmn] operations

≈ O (I*k*m*n)

Here, we’ve applied the basic assumption of each operation requiring the same amount of time. Since, normally, k << m & n << m, the analysis shows that the k-means is linear in m, the number of points, and is scalable and efficient in processing large data sets.

Space Complexity of the Algorithm:

The space requirement for k-means are modest because only the data points and centroids are needed to be stored. Specifically, the storage required is O((m+k)n), where m is the number of objects (points) & n is the number of attributes considering n-dimensional objects.

From: Algorithm Analysis: The Basic K-means Clustering Algorithm

Saturday, May 11, 2013