Clustering Techniques in Python: A Deep Dive into K-Means and Hierarchical Clustering

Clustering is a crucial Unsupervised Learning task that involves partitioning a dataset into different groups or clusters. The goal is to segregate groups with similar traits and assign them into clusters. In Python, two common techniques used for clustering include K-Means and Hierarchical Clustering.

Clustering Techniques With K-Means And Hierarchical Clustering

This comprehensive tutorial delves into these techniques, explaining their nitty-gritty and illustrating their utilization with well-detailed examples. The tutorial targets both beginners and experienced Python enthusiasts, ensuring a clear explanation with a professional yet accessible tone.

K-Means Clustering

Hierarchical Clustering
Differences Between K-Means and Hierarchical Clustering
Conclusion

1. K-Means Clustering

K-Means Clustering algorithm divides a set of N datasets into K non-overlapping subgroups or clusters, where each data point belongs to the nearest centroid. It’s known as K-Means because it involves creating K clusters from a set of N data points, where K is always less than N.

The K-means algorithm starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative calculations to optimize the positions of the centroids.

Python Implementation of K-Means Clustering

We will use the ‘Sklearn’ library to implement K-Means clustering.

Firstly, let’s import the required libraries:

import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

Next, load the data and define the X variables:

data = pd.read_csv('data.csv')
X = data.iloc[:, [1,2]].values

Proceed to apply the KMeans algorithm:

kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter=300, n_init=10, random_state=0)
y_kmeans = kmeans.fit_predict(X)

Finally, let’s visualize the clusters:

plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.show()

2. Hierarchical Clustering

Like K-Means, Hierarchical Clustering also groups similar objects into clusters. However, unlike K-Means, it doesn’t require the number of clusters to be specified beforehand. It creates a hierarchy of clusters, hence the name.

Hierarchical Clustering comes in two types – Agglomerative and Divisive.

Agglomerative Hierarchical Clustering starts with each object as a separate cluster and then merges them into successively larger clusters.

On the other hand, Divisive Hierarchical Clustering begins with the whole set and proceeds to divide it into successively smaller clusters.

Python Implementation of Hierarchical Clustering

For the implementation of hierarchical clustering, we will make use of the ‘Scipy’ and ‘Sklearn’ libraries.

Let’s first import the required libraries:

from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

Next, create the linkage matrix:

linked = linkage(X, 'ward')

Then plot the dendrogram:

dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.show()

Finally, apply the clustering and carry out the visualization:

cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(X)

plt.scatter(X[:,0],X[:,1], c=cluster.labels_, cmap='rainbow')

3. Differences Between K-Means and Hierarchical Clustering

Although they both aim at clustering data points, there are few fundamental differences:

Number of Clusters: K-Means requires upfront knowledge of ‘K’ (number of clusters). In contrast, hierarchical clustering doesn’t require ‘K’ and generates the dendrogram, which can be visualized at different levels.
Algorithm: K-Means starts with ‘n’ number of clusters and keeps data points that have minimum distances. In hierarchical clustering, all instances start as individual clusters and merge as the distance decreases.

Application: K-Means performs better on large datasets due to its efficiency. Hierarchical clustering isn’t suitable for large datasets due to its high time complexity.

Conclusion

The choice of clustering technique primarily depends on the type of problem at hand, the dataset size, and the number of clusters. K-Means is a perfect choice when dealing with vast data and a known cluster number. Hierarchical clustering, on the other hand, works best for data interpretation through dendrograms and does not require a predetermined number of clusters.

It is crucial to comprehend that these are not the only clustering methods. Other techniques, such as DBSCAN, Mean-Shift, Spectral Clustering, and Expectation-Maximization (EM) Clustering, also exist. Therefore, Python enthusiasts and data scientists should strive to perfect their skills with these clustering methods as well. Happy coding!

Clustering Techniques With K-Means And Hierarchical Clustering