Data preparation Stay organized with collections Save and categorize content based on your preferences.
Page Summary
This document reviews data preparation for clustering, focusing on scaling features to the same range.
Normalization via Z-scores is suitable for Gaussian distributions, while log transforms are applied to power-law distributions.
For datasets that don't conform to standard distributions, quantiles are recommended to measure similarity between data points.
Handling missing data involves either removing affected examples or the feature, or predicting missing values using a machine learning model.
This section reviews the data preparation steps most relevant to clusteringfrom theWorking with numerical datamodule in Machine Learning Crash Course.
In clustering, you calculate the similarity between two examples by combiningall the feature data for those examples into a numeric value. This requires thefeatures to have the same scale, which can be accomplished by normalizing,transforming, or creating quantiles. If you want to transformyour data without inspecting its distribution, you can default to quantiles.
Normalizing data
You can transform data for multiple features to the same scale by normalizingthe data.
Z-scores
Whenever you see a dataset roughly shaped like aGaussian distribution,you should calculatez-scoresfor the data. Z-scores are the number of standard deviations a value is from themean. You can also use z-scores when the dataset isn't large enough forquantiles.
SeeZ-score scalingto review the steps.
Here is a visualization of two features of a dataset before and afterz-score scaling:

In the unnormalized dataset on the left, Feature 1 and Feature 2,respectively graphed on the x and y axes, don't have the same scale. On theleft, the red exampleappears closer, or more similar, to blue than to yellow. On the right, afterz-score scaling, Feature 1 and Feature 2 have the same scale, and the redexample appears closer to the yellow example. The normalized dataset gives amore accurate measure of similarity between points.
Log transforms
When a dataset perfectly conforms to apower law distribution, where datais heavily clumped at the lowest values, use a log transform. SeeLog scalingto review the steps.
Here is a visualization of a power-law dataset before and after a log transform:


Before log scaling (Figure 2), the red example appears more similar to yellow.After log scaling (Figure 3), red appears more similar to blue.
Quantiles
Binning the data into quantiles works well when the dataset does not conformto a known distribution. Take this dataset, for example:

Intuitively, two examples are more similar if only a few examples fall betweenthem, irrespective of their values, and more dissimilar if many examplesfall between them. The visualization above makes it difficult to see the totalnumber of examples that fall between red and yellow, or between red and blue.
This understanding of similarity can be brought out by dividing the dataset intoquantiles, or intervals that each contain equal numbers of examples, andassigning the quantile index to each example. SeeQuantile bucketingto review the steps.
Here is the previous distribution divided into quantiles, showing that red isone quantile away from yellow and three quantiles away from blue:
![A graph showing the data after conversion into quantiles. The line represent 20 intervals.]](/image.pl?url=https%3a%2f%2fdevelopers.google.com%2fstatic%2fmachine-learning%2fclustering%2fimages%2fQuantize.png&f=jpg&w=240)
You can choose any number \(n\) of quantiles. However, for the quantiles tomeaningfully represent the underlying data, your dataset should have at least\(10n\) examples. If you don't have enough data, normalize instead.
Check your understanding
For the following questions, assume you have enough data to create quantiles.
Question one

- The data distribution is Gaussian.
- You have some insight into what the data represents in the real that suggests the data shouldn't be transformed nonlinearly.
Question two

Missing data
If your dataset has examples with missing values for a certain feature, butthose examples occur rarely, you can remove these examples. If those examplesoccur frequently, you can either remove that feature altogether,or you can predict the missing values from other examples using a machinelearning model. For example, you canimpute missing numerical databy using aregression model trained on existing feature data.
Note: The problem of missing data is not specific to clustering. Insupervised learning, you can impute an "unknown" value to the feature.However, you cannot impute an "unknown" value when designing asimilarity measure, because it's not possible to quantify the similarity between"unknown" and any known value.Key terms:Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-25 UTC.