Data preparation

Page Summary

This document reviews data preparation for clustering, focusing on scaling features to the same range.
Normalization via Z-scores is suitable for Gaussian distributions, while log transforms are applied to power-law distributions.
For datasets that don't conform to standard distributions, quantiles are recommended to measure similarity between data points.
Handling missing data involves either removing affected examples or the feature, or predicting missing values using a machine learning model.

This section reviews the data preparation steps most relevant to clusteringfrom theWorking with numerical datamodule in Machine Learning Crash Course.

In clustering, you calculate the similarity between two examples by combiningall the feature data for those examples into a numeric value. This requires thefeatures to have the same scale, which can be accomplished by normalizing,transforming, or creating quantiles. If you want to transformyour data without inspecting its distribution, you can default to quantiles.

Normalizing data

You can transform data for multiple features to the same scale by normalizingthe data.

Z-scores

Whenever you see a dataset roughly shaped like aGaussian distribution,you should calculatez-scoresfor the data. Z-scores are the number of standard deviations a value is from themean. You can also use z-scores when the dataset isn't large enough forquantiles.

SeeZ-score scalingto review the steps.

Here is a visualization of two features of a dataset before and afterz-score scaling:

Two graphs comparing feature data before and after normalization — **Figure 1: A comparison of feature data before and after normalization.**

In the unnormalized dataset on the left, Feature 1 and Feature 2,respectively graphed on the x and y axes, don't have the same scale. On theleft, the red exampleappears closer, or more similar, to blue than to yellow. On the right, afterz-score scaling, Feature 1 and Feature 2 have the same scale, and the redexample appears closer to the yellow example. The normalized dataset gives amore accurate measure of similarity between points.

Log transforms

When a dataset perfectly conforms to apower law distribution, where datais heavily clumped at the lowest values, use a log transform. SeeLog scalingto review the steps.

Here is a visualization of a power-law dataset before and after a log transform:

A barchart with the majority of data at the low end — **Figure 2: A power law distribution.**

A graph showing a normal (Gaussian) distribution — **Figure 3: A log transform of Figure 2.**

Before log scaling (Figure 2), the red example appears more similar to yellow.After log scaling (Figure 3), red appears more similar to blue.

Quantiles

Binning the data into quantiles works well when the dataset does not conformto a known distribution. Take this dataset, for example:

A graph showing a data distribution prior to any preprocessing — **Figure 4: An uncategorizable distribution prior to any preprocessing.**

Intuitively, two examples are more similar if only a few examples fall betweenthem, irrespective of their values, and more dissimilar if many examplesfall between them. The visualization above makes it difficult to see the totalnumber of examples that fall between red and yellow, or between red and blue.

This understanding of similarity can be brought out by dividing the dataset intoquantiles, or intervals that each contain equal numbers of examples, andassigning the quantile index to each example. SeeQuantile bucketingto review the steps.

Here is the previous distribution divided into quantiles, showing that red isone quantile away from yellow and three quantiles away from blue:

A graph showing the data after conversion into quantiles. The line represent 20 intervals.] — **Figure 5: The distribution in Figure 4 after conversion into 20 quantiles.**

You can choose any number \(n\) of quantiles. However, for the quantiles tomeaningfully represent the underlying data, your dataset should have at least\(10n\) examples. If you don't have enough data, normalize instead.

Check your understanding

For the following questions, assume you have enough data to create quantiles.

Question one

A plot displaying three data distributions

How should you process the data distribution shown in the preceding graph?

Create quantiles.

Correct. Because the distribution does not match a standard data distribution, you should default to creating quantiles.

Normalize.

You typically normalize data if:

The data distribution is Gaussian.
You have some insight into what the data represents in the real that suggests the data shouldn't be transformed nonlinearly.

Neither case applies here. The data distribution isn't Gaussian because it isn't symmetric. And you don't know what these values represent in the real world.

Log transform.

This isn't a perfect power-law distribution, so don't use a log transform.

Question two

How would you process this data distribution?

Normalize.

Correct. This is a Gaussian distribution.

Create quantiles.

Incorrect. Since this is a Gaussian distribution, the preferred transform is normalization.

Log transform.

Incorrect. Only apply a log transform to power-law distributions.

Missing data

If your dataset has examples with missing values for a certain feature, butthose examples occur rarely, you can remove these examples. If those examplesoccur frequently, you can either remove that feature altogether,or you can predict the missing values from other examples using a machinelearning model. For example, you canimpute missing numerical databy using aregression model trained on existing feature data.

Note: The problem of missing data is not specific to clustering. Insupervised learning, you can impute an "unknown" value to the feature.However, you cannot impute an "unknown" value when designing asimilarity measure, because it's not possible to quantify the similarity between"unknown" and any known value.Key terms:

Clustering workflow

What is k-means clustering?

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-25 UTC.

Movatterモバイル変換

Data preparation Stay organized with collections Save and categorize content based on your preferences.