Numerical data: How a model ingests data using feature vectors

  • Models ingest data through floating-point arrays called feature vectors, which are derived from dataset features.

  • Feature vectors often utilize processed or transformed values instead of raw dataset values to enhance model learning.

  • Feature engineering is the crucial process of converting raw data into suitable representations for the model, encompassing techniques like normalization and binning.

  • Non-numerical data like strings must be converted into numerical values for use in feature vectors, a key aspect of feature engineering.

Until now, we've given you the impression that a model acts directly on therows of a dataset; however, models actually ingest data somewhat differently.

For example, suppose a dataset provides five columns, but only two of thosecolumns (b andd) are features in the model. When processingthe example in row 3, does the model simply grab the contents of thehighlighted two cells (3b and 3d) as follows?

Figure 1. A model ingesting an example directly from a dataset.            Columns b and d of Row 3 are highlighted.
Figure 1. Not exactly how a model gets its examples.

In fact, the model actually ingests an array of floating-point values called afeature vector. You can thinkof a feature vector as the floating-point values comprising one example.

Figure 2. The feature vector is an intermediary between the dataset            and the model.
Figure 2. Closer to the truth, but not realistic.

However, feature vectors seldom use the dataset'sraw values.Instead, you must typically process the dataset's values into representationsthat your model can better learn from. So, a more realisticfeature vector might look something like this:

Figure 3. The feature vector contains two floating-point values:            0.13 and 0.47. A more realistic feature vector.
Figure 3. A more realistic feature vector.

Wouldn't a model produce better predictions by training from theactual values in the dataset than fromaltered values?Surprisingly, the answer is no.

You must determine the best way to represent raw dataset values as trainablevalues in the feature vector.This process is calledfeature engineering,and it is a vital part of machine learning.The most common feature engineering techniques are:

  • Normalization: Convertingnumerical values into a standard range.
  • Binning (also referred to asbucketing): Converting numericalvalues into buckets of ranges.

This unit covers normalizing and binning. The next unit,Working with categorical data,covers other forms ofpreprocessing, such asconverting non-numerical data, like strings, to floating point values.

Every value in a feature vector must be a floating-point value. However, manyfeatures are naturally strings or other non-numerical values. Consequently,a large part of feature engineering is representing non-numerical values asnumerical values. You'll see a lot of this in later modules.

Key terms:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-25 UTC.