k-Means Clustering (basic algorithms -machine learning )

k-means clustering is an unsupervised learning algorithm used to partition a given dataset into several groups (clusters). It works by iteratively updating the “centroids” – the center points of each cluster – and assigning data points to the closest centroid.

Features:

Simple and Fast: Relatively easy to compute, making it applicable to large datasets.
Requires Parameters: The number of clusters (k) needs to be specified in advance.
Initial Value Dependency: Results can vary depending on the initial selection of centroids.
Strong for Convex Clusters: Good at finding spherical or circular clusters.

Steps of k-Means Clustering

Initialization: Randomly select k centroids from the dataset as initial values.
Assignment: Assign each data point to the cluster with the nearest centroid. Euclidean distance is often used to measure proximity.
Update: Move each cluster’s centroid to the average position of the data points belonging to that cluster.
Iteration: Repeat steps 2 and 3 until the cluster assignments no longer change, or a pre-defined maximum number of iterations is reached.

Let’s consider the following 2D dataset as an example:

(1, 2), (1, 4), (1, 0), (10, 2), (10, 4), (10, 0)

We want to cluster this data with k=2.

Initialization: Randomly set the centroids to (1, 3) and (9, 3).
Assignment: Calculate the distance from each data point to the centroids and assign it to the nearest cluster:
- (1, 2) -> (1, 3)
- (1, 4) -> (1, 3)
- (1, 0) -> (1, 3)
- (10, 2) -> (9, 3)
- (10, 4) -> (9, 3)
- (10, 0) -> (9, 3)
Update: Calculate the centroids for each cluster:
- Cluster 1: ((1+1+1)/3, (2+4+0)/3) = (1, 2)
- Cluster 2: ((10+10+10)/3, (2+4+0)/3) = (10, 2)
Iteration: Repeat steps 2 and 3 until the centroids no longer change or a maximum number of iterations is reached.

Finally, we would obtain the following clusters:

Cluster 1: (1, 2), (1, 4), (1, 0)
Cluster 2: (10, 2), (10, 4), (10, 0)

Python Code Example (using scikit-learn)

from sklearn.cluster import KMeans
import numpy as np

# Prepare the dataset
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])

# Create a k-means model (k=2)
kmeans = KMeans(n_clusters=2, random_state=0, n_init='auto') #random_state fixes the initial values. n_init is specified to avoid warnings.

# Train the model
kmeans.fit(X)

# Get cluster labels
labels = kmeans.labels_
print("Cluster Labels:", labels)  # Output: [1 1 1 0 0 0]

# Get centroids
centroids = kmeans.cluster_centers_
print("Centroids:", centroids) #Output: [[ 1.  2.] [10.  2.]]

# Get the clusters
clusters = [[] for _ in range(2)]
for i, label in enumerate(labels):
    clusters[label].append(X[i])

print("Cluster 1:", clusters[0]) #Output: [[1 2] [1 4] [1 0]]
print("Cluster 2:", clusters[1]) #Output: [[10 2] [10 4] [10 0]]

from sklearn.cluster import KMeans
import numpy as np

# Prepare the dataset
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])

# Create a k-means model (k=2)
kmeans = KMeans(n_clusters=2, random_state=0, n_init='auto') #random_state fixes the initial values. n_init is specified to avoid warnings.

# Train the model
kmeans.fit(X)

# Get cluster labels
labels = kmeans.labels_
print("Cluster Labels:", labels)  # Output: [1 1 1 0 0 0]

# Get centroids
centroids = kmeans.cluster_centers_
print("Centroids:", centroids) #Output: [[ 1.  2.] [10.  2.]]

# Get the clusters
clusters = [[] for _ in range(2)]
for i, label in enumerate(labels):
    clusters[label].append(X[i])

print("Cluster 1:", clusters[0]) #Output: [[1 2] [1 4] [1 0]]
print("Cluster 2:", clusters[1]) #Output: [[10 2] [10 4] [10 0]]

Important Notes

Choosing the value of k: The value of k should be determined appropriately using metrics such as the Elbow Method or Silhouette Analysis.
Data Standardization: If the data scales are different, standardizing the data can yield better results.
Addressing Initial Value Dependency: It’s recommended to train with multiple initializations and select the best result.

k-means clustering is a simple and fast clustering algorithm, but it’s important to be mindful of parameter selection and initial value dependency. Libraries like scikit-learn make implementation easy.

Euclidean Distance Explained

It’s a way to measure the “straight-line distance” between two points.

In a plane (2 dimensions): It’s calculated using the Pythagorean theorem. For example, the distance between two points (x1, y1) and (x2, y2) is √((x2-x1)² + (y2-y1)²).
In space (3 or more dimensions): You sum the squares of the differences in each coordinate axis and then take the square root.

Real-life image: It’s similar to measuring the shortest distance between two locations on a map.

Applications: Used in various fields such as data analysis in machine learning (clustering, etc.), image processing, robotics, and more.