Numerical data: Scrubbing

  • Like sorting good apples from bad, ML engineers spend significant time cleaning data by removing or fixing bad examples to improve dataset quality.

  • Common data problems include omitted values, duplicate examples, out-of-range values, and incorrect labels, which can negatively impact model performance.

  • You can use programs or scripts to identify and handle data issues such as omitted values, duplicates, and out-of-range feature values by removing or correcting them.

  • When multiple individuals label data, it's important to check for consistency and identify potential biases to ensure label quality.

  • Addressing data quality issues before training a model leads to better model accuracy and overall performance.

Apple trees produce a mixture of great fruit and wormy messes.Yet the apples in high-end grocery stores display 100% perfect fruit.Between orchard and grocery, someone spends significant time removingthe bad apples or spraying a little wax on the salvageable ones.As an ML engineer, you'll spend enormous amounts of your timetossing out bad examples and cleaning up the salvageable ones.Even a few bad apples can spoil a large dataset.

Many examples in datasets are unreliable due to one or more of thefollowing problems:

Problem categoryExample
Omitted valuesA census taker fails to record a resident's age.
Duplicate examplesA server uploads the same logs twice.
Out-of-range feature values.A human accidentally types an extra digit.
Bad labelsA human evaluator mislabels a picture of an oak tree as a maple.

You can write a program or script to detect any of the following problems:

  • Omitted values
  • Duplicate examples
  • Out-of-range feature values

For example, the following dataset contains six repeated values:

Figure 15. The first six values are repeated. The final eight            values are not.
Figure 15. The first six values are repeated.

As another example, suppose the temperature range for a certain feature mustbe between 10 and 30 degrees, inclusive. But accidents happen—perhaps athermometer is temporarily exposed to the sun which causes a bad outlier.Your program or script must identify temperature values less than 10 or greaterthan 30:

Figure 16. Nineteen in-range values and one out-of-range value.
Figure 16. An out-of-range value.

When labels are generated by multiple people, we recommend statisticallydetermining whether each rater generated equivalent sets of labels.Perhaps one rater was a harsher grader than the other raters or useda different set of grading criteria?

Once detected, you typically "fix" examples that contain bad featuresor bad labels by removing them from the dataset or imputing their values.For details, see theData characteristicssection of theDatasets, generalization, and overfittingmodule.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-25 UTC.