Clustering algorithms

  • Many clustering algorithms have a complexity of O(n^2), making them impractical for large datasets, while the k-means algorithm scales linearly with a complexity of O(n).

  • Clustering approaches include centroid-based, density-based, distribution-based, and hierarchical clustering, each suited for different data distributions and structures.

  • Centroid-based clustering, particularly k-means, is efficient for grouping data into non-hierarchical clusters based on the mean of data points, but is sensitive to initial conditions and outliers.

  • Density-based clustering connects areas of high data density, effectively discovering clusters of varying shapes, but struggles with clusters of differing densities and high-dimensional data.

  • Distribution-based clustering assumes data follows specific distributions (e.g., Gaussian), assigning points based on probability, while hierarchical clustering creates a tree of clusters, suitable for hierarchical data.

Machine learning datasets can have millions ofexamples, but not all clustering algorithms scale efficiently. Many clusteringalgorithms compute the similarity between all pairs of examples, whichmeans their runtime increases as the square of the number of examples \(n\),denoted as \(O(n^2)\) in complexity notation. \(O(n^2)\) algorithms are notpractical for datasets with millions of examples.

Thek-means algorithm has acomplexity of \(O(n)\), meaning that the algorithm scales linearly with \(n\).This algorithm will be the focus of this course.

Types of clustering

For an exhaustive list of different approaches to clustering, seeA Comprehensive Survey of Clustering AlgorithmsXu, D. & Tian, Y. Ann. Data. Sci. (2015) 2: 165. Each approach is best suited toa particular data distribution. This course briefly discusses four commonapproaches.

Centroid-based clustering

Thecentroid of a cluster is thearithmetic mean of all the points in thecluster.Centroid-based clustering organizes the data into non-hierarchicalclusters. Centroid-based clustering algorithms are efficient but sensitive toinitial conditions and outliers. Of these, k-means is the mostwidely used. It requires users to define the number of centroids,k, andworks well with clusters of roughly equal size.

Examples grouped into clusters using centroid-based clustering.           The lines show borders between clusters.
Figure 1: Example of centroid-based clustering.

Density-based clustering

Density-based clustering connects contiguous areas of high example density intoclusters. This allows for the discovery of any number of clusters of any shape.Outliers are not assigned to clusters. These algorithms have difficulty withclusters of different density and data with high dimensions.

Examples grouped into two clusters using density-based clustering.      The clusters are not separable linearly.
Figure 2: Example of density-based clustering.

Distribution-based clustering

This clustering approach assumes data is composed of probabilisticdistributions, such asGaussian distributions. InFigure 3, the distribution-based algorithm clusters data into three Gaussiandistributions. As distance from the distribution's center increases, theprobability that a point belongs to the distribution decreases. The bands showthat decrease in probability. When you're not comfortable assuming a particularunderlying distribution of the data, you should use a different algorithm.

Examples clustered using distribution-based clustering. The shading of the density of examples in each cluster shows how the clusters map to distributions.
Figure 3: Example of distribution-based clustering.

Hierarchical clustering

Hierarchical clustering creates a tree of clusters. Hierarchical clustering,not surprisingly, is well suited to hierarchical data, such as taxonomies. SeeComparison of 61 Sequenced Escherichia coli Genomesby Oksana Lukjancenko, Trudy Wassenaar & Dave Ussery for an example.Any number of clusters can be chosen by cutting the tree at the right level.

Animals clustered by using a hierarchical tree.
Figure 4: Example of a hierarchical tree clustering animals.
Key terms:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-25 UTC.