The basic k-means clustering algorithm is a simple algorithm that separates the given data space into different clusters based on centroids calculation using some proximity function. Using this algorithm, we first choose the k- points as initial centroids and then each point is assigned to a cluster with the closest centroid. The algorithm is formally described as follows:

**Input:**A data set D containing m objects (points) with n attributes in an Euclidean space

**Output:**Partitioning of m objects into k-clusters C

_{1,}C

_{2}, C

_{3,}…, C

_{k},

*i.e.*C

_{i}⊂ D and C

_{i}∩ C

_{j}= ᶲ (for 1 ≤ i, j ≤ k)

#### Procedure:

- Select k points as initial centroid
**Repeat**- Form k clusters by assigning each point to its closest centroid
- Re-compute the centroid of each cluster
**Until**the centroids don’t change

#### Time Complexity of the algorithm:

To analyze the complexity of the algorithm in terms of time required, we need to identify the operations required in each step and in each iteration of the algorithm. Here, for k-means algorithm, the operations for each iteration are:

- Calculation of distances: To calculate the distance from a point to the centroid, we can use the squared Euclidean proximity function. For this, we need two subtractions, one summation, two multiplications and one square - root operations, i.e., 6-operations.
- Comparisons between distances
- Calculation of centroids

So, the total number of operations for an iteration

= distance calculation operations + comparison operations + centroids calculation operations

= 6*[k*m*n] operations + [(k-1)*m*n] operations + [k*((m-1) + 1)*n] operations

Assume that the algorithm converges after I iterations, then total number of operations for the algorithm

= 6*[I*k*m*n] operations + [I*(k-1)*m*n] operations + [I*k*((m-1) + 1)*n] operations

= [6Ikmn + I(k-1)mn + Ikmn] operations

≈ O (I*k*m*n)

Here, we’ve applied the basic assumption of each operation requiring the same amount of time. Since, normally, k << m & n << m, the analysis shows that the k-means is linear in m, the number of points, and is scalable and efficient in processing large data sets.

#### Space Complexity of the Algorithm:

The space requirement for k-means are modest because only the data points and centroids are needed to be stored. Specifically, the storage required is O((m+k)n), where m is the number of objects (points) & n is the number of attributes considering n-dimensional objects.

## No comments :

## Post a Comment