General

K-Means Clustering with Python: A Practical Guide

K-Means Clustering with Python: A Practical Guide 2

Understanding K-Means Clustering

K-means clustering is a type of unsupervised learning algorithm that is used to divide a given dataset into n-clusters or subgroups. In this algorithm, we first have to choose a number of groups at random and then the algorithm tries to group similar data into those clusters. The algorithm iteratively optimizes the groups’ centroids until we get the most optimal clusters.

How K-Means Clustering Works

The approach of K-means clustering is similar to optimizing cost function in linear regression to find the best fit line. We start by selecting the number of clusters that we want to divide the data into. The keypoints in the algorithm are: Uncover fresh insights on the subject using this carefully chosen external resource to improve your reading experience. k means clustering python https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/.

  • Select the number of clusters to divide the data into.
  • Randomly choose the center of the cluster, the centroid.
  • Calculate the distance of each datapoint to each centroid.
  • Assign each datapoint to the closest centroid.
  • Recalculate the centroids based on the average distance of the assigned datapoints.
  • Repeat steps 3-5 until the centroids no longer move or converge.
  • Advantages of K-Means Clustering:

    K-means clustering is easy to interpret and implement, and it works well with large datasets since the algorithm converges faster.

    Applications of K-Means Clustering:

    K-Means clustering is being used in various applications such as:

  • Image Segmentation
  • Anomaly Detection
  • Market Segmentation
  • Recommender Systems
  • K-Means Clustering with Python:

    We will be using the Python scikit-learn library, which is a popular library for machine learning in Python. The following are the libraries that need to be imported for the implementation of K-means clustering:

    “`python

    import numpy as np

    import pandas as pd

    import matplotlib.pyplot as plt

    from sklearn.cluster import KMeans

    “`

    Example Implementation of K-Means Clustering:

    Let us see how we can implement K-means clustering for a given dataset. We will be working with a sample dataset that contains the heights and weights of individuals, where the aim is to segment individuals into different groups based on their weight and height.

    “`python

    #import the dataset

    dataset = pd.read_csv(‘sample_data.csv’)

    x = dataset.iloc[:, [2, 3]].values

    “`

    The above code imports the dataset and extracts the required columns’ values. We will then determine the optimal number of clusters for our dataset using the elbow method, which plots the graph of the number of clusters against the WCSS (Within Cluster Sum of Squares)

    The optimal number of clusters is the point where the curve starts to flatten out, giving a near linear silhouette score.

    “`python

    # using the elbow method

    wcss = []

    for i in range(1, 11):

    kmeans = KMeans(n_clusters = i, init = ‘k-means++’, max_iter = 300, n_init = 10, random_state = 0)

    kmeans.fit(x)

    wcss.append(kmeans.inertia_)

    plt.plot(range(1, 11), wcss)

    plt.title(‘The Elbow Method’)

    plt.xlabel(‘Number of clusters’)

    plt.ylabel(‘WCSS’)

    plt.show()

    “`

    The elbow method gives us an optimal value of 3 clusters. We can now perform the clustering using K-means clustering algorithm.

    “`python

    # Training the K-means model on a dataset

    kmeans = KMeans(n_clusters = 3, init = ‘k-means++’, max_iter = 300, n_init = 10, random_state = 0)

    y_kmeans = kmeans.fit_predict(x)

    “`

    Finally, we plot the clusters

    “`python

    # Visualizing the clusters in the dataset

    plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = ‘red’, label = ‘Cluster 1’)

    plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = ‘blue’, label = ‘Cluster 2’)

    plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = ‘green’, label = ‘Cluster 3’)

    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = ‘yellow’, label = ‘Centroids’)

    plt.title(‘Clusters of individuals’)

    plt.xlabel(‘Height (in cms)’)

    plt.ylabel(‘Weight (in pounds)’)

    plt.legend()

    plt.show()

    “`

    Conclusion

    K-means clustering is an efficient algorithm that helps us to divide a given dataset into n-clusters. It is easy to implement and interpret, and it works well with large datasets. In this article, we learned how to implement K-means clustering, determine the optimal number of clusters using the elbow method, and visualize the clusters on a sample dataset using Python scikit-learn libraries. To discover more and complementary information about the subject discussed, we’re committed to providing an enriching educational experience. Www.Analyticsvidhya.com!

    Wish to learn more about this topic? Access the related posts we’ve chosen to complement your reading experience:

    Read further

    Learn from this detailed guide