Working with categorical data Stay organized with collections Save and categorize content based on your preferences.
Page Summary
This module focuses on differentiating between categorical and numerical data within machine learning.
You will learn how to represent categorical data using one-hot vectors and address common issues associated with it.
The module covers encoding techniques for converting categorical data into numerical vectors suitable for model training.
Feature crosses, a method for combining categorical features to capture interactions, are also discussed.
It is assumed you have prior knowledge of introductory machine learning and working with numerical data.
- Distinguish categorical data from numerical data.
- Represent categorical data with one-hot vectors.
- Address common issues with categorical data.
- Create feature crosses.
This module assumes you are familiar with the concepts covered in the following modules:
Categorical data has aspecific set of possible values. For example:
- The different species of animals in a national park
- The names of streets in a particular city
- Whether or not an email is spam
- The colors that house exteriors are painted
- Binned numbers, which are described in theWorking with NumericalData module
Numbers can also be categorical data
Truenumerical datacan be meaningfully multiplied. For example, consider amodel that predicts the value of a house based on its area.Note that a useful model for evaluating house prices typically relies onhundreds of features. That said, all else being equal, a house of 200 squaremeters should be roughly twice as valuable as an identical house of 100 squaremeters.
Oftentimes, you should represent features that contain integer values ascategorical data instead of numerical data. For example, consider a postalcode feature in which the values are integers. If you represent thisfeature numerically rather than categorically, you're asking the modelto find a numeric relationshipbetween different postal codes. That is, you're telling the model totreat postal code 20004 as twice (or half) as large a signal as postal code10002. Representing postal codes as categorical data lets the modelweight each individual postal code separately.
Encoding
Encoding means converting categorical or other data to numerical vectorsthat a model can train on. This conversion is necessary because models canonly train on floating-point values; models can't train on strings such as"dog" or"maple". This module explains differentencoding methods for categorical data.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-25 UTC.