Datasets: Transforming data

  • Machine learning models require all data, including features like street names, to be transformed into numerical (floating-point) representations for training.

  • Normalization is crucial for optimizing model training by converting existing floating-point features to a specific range.

  • When dealing with large datasets, selecting a relevant subset of data for training is essential for model performance.

  • Protecting user privacy by excluding Personally Identifiable Information (PII) from datasets is a critical consideration.

Machine learning models can only train on floating-point values.However, many dataset features arenot naturally floating-point values.Therefore, one important part of machine learning is transformingnon-floating-point features to floating-point representations.

For example, supposestreet names is a feature. Most street namesare strings, such as "Broadway" or "Vilakazi".Your model can't train on "Broadway", so you must transform "Broadway"to a floating-point number. TheCategorical Datamoduleexplains how to do this.

Additionally, you should even transform most floating-point features.This transformation process, callednormalization, convertsfloating-point numbers to a constrained range that improves model training.TheNumerical Datamoduleexplains how to do this.

Sample data when you have too much of it

Some organizations are blessed with an abundance of data.When the dataset contains too many examples, you must select asubsetof examples for training. When possible, select the subset that is mostrelevant to your model's predictions.

Filter examples containing PII

Good datasets omit examples containing Personally Identifiable Information(PII). This policy helps safeguard privacy but can influence the model.

See the Safety and Privacy module later in the course for more on these topics.

Key terms:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-25 UTC.