What is clustering?

  • Clustering is an unsupervised machine learning technique used to group similar unlabeled data points into clusters based on defined similarity measures.

  • Cluster analysis can be applied to various domains like market segmentation, social network analysis, and medical imaging to identify patterns and simplify complex datasets.

  • Clustering enables data compression by replacing numerous features with a single cluster ID, reducing storage and processing needs.

  • It facilitates data imputation by inferring missing feature data from other examples within the same cluster.

  • Clustering offers a degree of privacy preservation by associating user data with cluster IDs instead of individual identifiers.

Suppose you are working with a dataset that includes patient information from ahealthcare system. The dataset is complex and includes both categorical andnumeric features. You want to find patterns and similarities in the dataset.How might you approach this task?

Clustering is an unsupervisedmachine learning technique designed to groupunlabeled examplesbased on their similarity to each other. (If the examples are labeled, thiskind of grouping is calledclassification.)Consider a hypothetical patientstudy designed to evaluate a new treatment protocol. During the study, patientsreport how many times per week they experience symptoms and the severity of thesymptoms. Researchers can use clustering analysis to group patients with similartreatment responses into clusters. Figure 1 demonstrates one possible groupingof simulated data into three clusters.

On the left, a graph of symptom severity vs. symptom count   displaying datapoints that suggest three clusters.   On the right, the same graph but with each of the three clusters colored.
Figure 1: Unlabeled examples grouped into three clusters (simulated data).

Looking at the unlabeled data on the left of Figure 1, you could guess thatthe data forms three clusters, even without a formal definition of similaritybetween data points. In real-world applications, however, you need to explicitlydefine asimilarity measure, or the metric used to compare samples, interms of the dataset's features. When examples have only a couple of features,visualizing and measuring similarity is straightforward. But as the number offeatures increases, combining and comparing features becomes less intuitiveand more complex. Different similarity measures may be more or less appropriatefor different clustering scenarios, and this course will address choosing anappropriate similarity measure in later sections:Manual similarity measuresandSimilarity measure from embeddings.

After clustering, each group is assigned a unique label called acluster ID.Clustering is powerful because it can simplify large, complex datasets withmany features to a single cluster ID.

Clustering use cases

Clustering is useful in a variety of industries. Some common applicationsfor clustering:

  • Market segmentation
  • Social network analysis
  • Search result grouping
  • Medical imaging
  • Image segmentation
  • Anomaly detection

Some specific examples of clustering:

  • TheHertzsprung-Russell diagramshows clusters of stars when plotted by luminosity and temperature.
  • Gene sequencing that shows previously unknown genetic similarities anddissimilarities between species has led to the revision of taxonomiespreviously based on appearances.
  • TheBig 5model of personality traits was developed by clustering words thatdescribe personality into 5 groups. TheHEXACOmodel uses 6 clusters instead of 5.

Imputation

When some examples in a cluster have missing feature data, you can infer themissing data from other examples in the cluster. This is calledimputation.For example, less popular videos can be clustered with more popular videosto improve video recommendations.

Data compression

As discussed, the relevant cluster ID can replace other features for allexamples in that cluster. This substitution reduces the number of features andtherefore also reduces the resources needed to store, process, and train modelson that data. For very large datasets, these savings become significant.

To give an example, a single YouTube video can have feature data including:

  • viewer location, time, and demographics
  • comment timestamps, text, and user IDs
  • video tags

Clustering YouTube videos replaces this set of features with asingle cluster ID, thus compressing the data.

Privacy preservation

You can preserve privacy somewhat by clustering users and associating user datawith cluster IDs instead of user IDs. To give one possible example, say you wantto train a model on YouTube users' watch history. Instead of passing user IDsto the model, you could cluster users and pass only the cluster ID. Thiskeeps individual watch histories from being attached to individual users. Notethat the cluster must contain a sufficiently large number of users in order topreserve privacy.

Key terms:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-25 UTC.