Categorical data: Vocabulary and one-hot encoding Stay organized with collections Save and categorize content based on your preferences.
Page Summary
Dimension refers to the number of elements in a feature vector, and some categorical features have low dimensionality.
Machine learning models require numerical input; therefore, categorical data like strings must be converted to numerical representations.
One-hot encoding transforms categorical values into numerical vectors where each category is represented by a unique element with a value of 1.
For high-dimensional categorical features with numerous categories, one-hot encoding might be inefficient, and embeddings or hashing are recommended.
Sparse representation efficiently stores one-hot encoded data by only recording the position of the '1' value to reduce memory usage.
The termdimension is a synonym for the number of elements in afeature vector.Some categorical features are low dimensional. For example:
| Feature name | # of categories | Sample categories |
|---|---|---|
| snowed_today | 2 | True, False |
| skill_level | 3 | Beginner, Practitioner, Expert |
| season | 4 | Winter, Spring, Summer, Autumn |
| day_of_week | 7 | Monday, Tuesday, Wednesday |
| planet | 8 | Mercury, Venus, Earth |
When a categorical feature has a low number of possible categories, you canencode it as avocabulary. With a vocabulary encoding, the model treats eachpossible categorical value as aseparate feature. During training, themodel learns different weights for each category.
For example, suppose you are creating a model to predict a car's price based,in part, on a categorical feature namedcar_color.Perhaps red cars are worth more than green cars.Since manufacturers offer a limited number of exterior colors,car_color isa low-dimensional categorical feature.The following illustration suggests a vocabulary (possible values) forcar_color:
Index numbers
Machine learning models can only manipulate floating-point numbers.Therefore, you must convert each string to a unique index number, as inthe following illustration:
After converting strings to unique index numbers, you'll need to process thedata further to represent it in ways that help the model learn meaningfulrelationships between the values. If the categorical feature data is left asindexed integers and loaded into a model, the model would treat the indexedvalues as continuous floating-point numbers. The model would then consider"purple" six times more likely than "orange."
One-hot encoding
The next step in building a vocabulary is to convert each index number toitsone-hot encoding.In a one-hot encoding:
- Each category is represented by a vector (array) of N elements, where Nis the number of categories. For example, if
car_colorhas eight possiblecategories, then the one-hot vector representing will have eight elements. - Exactlyone of the elements in a one-hot vector has the value 1.0;all the remaining elements have the value 0.0.
For example, the following table shows the one-hot encoding for each color incar_color:
| Feature | Red | Orange | Blue | Yellow | Green | Black | Purple | Brown |
|---|---|---|---|---|---|---|---|---|
| "Red" | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| "Orange" | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| "Blue" | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| "Yellow" | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| "Green" | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| "Black" | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| "Purple" | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| "Brown" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
It is the one-hot vector, not the string or the index number, that gets passedto the feature vector. The model learns a separate weight for each element ofthe feature vector.
Note: In a true one-hot encoding, only one element has the value 1.0.In a variant known asmulti-hot encoding, multiple valuescan be 1.0.The following illustration suggests the various transformations in thevocabulary representation:
Sparse representation
A feature whose values are predominantly zero (or empty) is termed asparse feature. Manycategorical features, such ascar_color, tend to be sparse features.Sparse representationmeans storing theposition of the 1.0in a sparse vector. For example, the one-hot vector for"Blue" is:
[0, 0, 1, 0, 0, 0, 0, 0]
Since the1 is in position 2 (when starting the count at 0), thesparse representation for the preceding one-hot vector is:
2
Notice that the sparse representation consumes far less memory than theeight-element one-hot vector. Importantly, the model musttrain on theone-hot vector, not the sparse representation.
Note: The sparse representation of a multi-hot encoding stores thepositions ofall the nonzero elements. For example, the sparserepresentation of a car that is both"Blue" and"Black" is2, 5.Outliers in categorical data
Like numerical data, categorical data also contains outliers. Supposecar_color contains not only the popular colors, but also some rarely usedoutlier colors, such as"Mauve" or"Avocado".Rather than giving each of these outlier colors a separate category, youcan lump them into a single "catch-all" category calledout-of-vocabulary(OOV). In other words, all the outlier colors are binned into a singleoutlier bucket. The system learns a single weight for that outlier bucket.
Encoding high-dimensional categorical features
Some categorical features have a high number of dimensions, such asthose in the following table:
| Feature name | # of categories | Sample categories |
|---|---|---|
| words_in_english | ~500,000 | "happy", "walking" |
| US_postal_codes | ~42,000 | "02114", "90301" |
| last_names_in_Germany | ~850,000 | "Schmidt", "Schneider" |
When the number of categories is high, one-hot encoding is usually a bad choice.Embeddings, detailed in a separateEmbeddings module, are usuallya much better choice. Embeddings substantially reduce the number ofdimensions, which benefits models in two important ways:
- The model typically trains faster.
- The built model typically infers predictions more quickly. That is, the model has lower latency.
Hashing (also called thehashingtrick) is a less common way to reduce the number of dimensions.
Click here to learn about hashing
In brief, hashing maps a category (for example, a color) to a smallinteger—the number of the "bucket" that will hold that category.
In detail, you implement a hashing algorithm as follows:
- Set the number of bins in the vector of categories to N, where N is less than the total number of remaining categories. As an arbitrary example, say N = 100.
- Choose a hash function. (Often, you will choose the range of hash values as well.)
- Pass each category (for example, a particular color) through that hash function, generating a hash value, say 89237.
- Assign each bin an index number of the output hash value modulo N. In this case, where N is 100 and the hash value is 89237, the modulo result is 37 because 89237 % 100 is 37.
- Create a one-hot encoding for each bin with these new index numbers.
For more details about hashing data, see theRandomizationsection of theProduction machine learning systemsmodule.
Exercise: Check your understanding
"Red" is not a floating-point number. You must convert strings like"Red" to floating-point numbers.Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-19 UTC.