Datasets: Data characteristics

  • A machine learning model's performance is heavily reliant on the quality and quantity of the dataset it's trained on, with larger, high-quality datasets generally leading to better results.

  • Datasets can contain various data types, including numerical, categorical, text, multimedia, and embedding vectors, each requiring specific handling for optimal model training.

  • Maintaining data quality involves addressing issues like label errors, noisy features, and proper filtering to ensure the reliability of the dataset for accurate predictions.

  • Incomplete examples with missing feature values should be handled by either deletion or imputation to avoid negatively impacting model training.

  • When imputing missing values, use reliable methods like mean/median imputation and consider adding an indicator column to signal imputed values to the model.

Adataset is a collection ofexamples.

Many datasets store data in tables (grids), for example, ascomma-separated values (CSV) or directly from spreadsheets ordatabase tables. Tables are an intuitive input format for machinelearningmodels.You can imagine each row of the table as an exampleand each column as a potential feature or label.That said, datasets may also be derived from other formats, includinglog files and protocol buffers.

Regardless of the format, your ML model is only as good as thedata it trains on. This section examines key data characteristics.

Types of data

A dataset could contain many kinds of datatypes, including but certainlynot limited to:

  • numerical data, which is covered in aseparateunit
  • categorical data, which is covered in aseparateunit
  • human language, including individual words and sentences, all the way up toentire text documents
  • multimedia (such as images, videos, and audio files)
  • outputs from other ML systems
  • embedding vectors, which arecovered in a later unit

Quantity of data

As a rough rule of thumb, your model should train on at least an orderof magnitude (or two) more examples than trainable parameters. However, goodmodels generally train onsubstantially more examples than that.

Models trained on large datasets with fewfeaturesgenerally outperform models trained on small datasets witha lot of features.Google has historically had great success training simple models onlarge datasets.

Different datasets for different machine learning programs may require wildlydifferent amounts of examples to build a useful model. For some relativelysimple problems, a few dozen examples might be sufficient. For other problems,a trillion examples might be insufficient.

It's possible to get good results from a small dataset if you are adaptingan existing model already trained on large quantities of data from thesame schema.

Quality and reliability of data

Everyone prefers high quality to low quality, but quality is such a vagueconcept that it could be defined many different ways. This course definesquality pragmatically:

A high-quality dataset helps your model accomplish its goal.A low quality dataset inhibits your model from accomplishing its goal.

A high-quality dataset is usually also reliable.Reliability refers to the degree to which you cantrust your data.A model trained on a reliable dataset is more likely to yield usefulpredictions than a model trained on unreliable data.

Inmeasuring reliability, you must determine:

  • How common are label errors? For example, if your data islabeled by humans, how often did your human raters make mistakes?
  • Are your featuresnoisy? That is, do the values in your featurescontain errors? Be realistic—you can't purge your datasetof all noise. Some noise is normal; for example, GPS measurements of anylocation always fluctuate a little, week to week.
  • Is the data properly filtered for your problem? For example,should your dataset include search queries from bots? If you'rebuilding a spam-detection system, then likely the answer is yes.However, if you're trying to improve search results for humans, then no.

The following are common causes of unreliable data in datasets:

  • Omitted values. For example, a person forgot to enter a value for ahouse's age.
  • Duplicate examples. For example, a server mistakenly uploaded the samelog entries twice.
  • Bad feature values. For example, someone typed an extra digit, or athermometer was left out in the sun.
  • Bad labels. For example, a person mistakenly labeled a picture of anoak tree as a maple tree.
  • Bad sections of data. For example, a certain feature is very reliable,except for that one day when the network kept crashing.

We recommend using automation to flag unreliable data. For example,unit tests that define or rely on an external formal data schema canflag values that fall outside of a defined range.

Note: Any sufficiently large or diverse dataset almost certainly containsoutliers that fall outside yourdata schema or unit test bands.Determining how to handle outliers is an important part of machine learning.TheNumerical dataunitdetails how to handle numeric outliers.

Complete vs. incomplete examples

In a perfect world, each example iscomplete; that is, each example containsa value for each feature.

Figure 1. An example containing values for all five of its       features.
Figure 1. A complete example.

 

Unfortunately, real-world examples are oftenincomplete, meaning that atleast one feature value is missing.

Figure 2. An example containing values for four of its five            features. One feature is marked missing.
Figure 2. An incomplete example.

 

Don't train a model on incomplete examples. Instead, fix or eliminateincomplete examples by doing one of the following:

  • Delete incomplete examples.
  • Impute missing values;that is, convert the incomplete example to a complete example by providingwell-reasoned guesses for the missing values.
Figure 3. A dataset containing three examples, two of which are            incomplete examples. Someone has stricken these two incomplete            examples from the dataset.
Figure 3. Deleting incomplete examples from the dataset.

 

Figure 4. A dataset containing three examples, two of which were            incomplete examples containing missing data. Some entity (a human            or imputation software) has imputed values that replaced the            missing data.
Figure 4. Imputing missing values for incomplete examples.

 

If the dataset contains enough complete examples to train a useful model,then consider deleting the incomplete examples.Similarly, if only one feature is missing a significant amount of data and thatone feature probably can't help the model much, then consider deletingthat feature from the model inputs and seeing how much quality is lost by itsremoval. If the model works just or almost as well without it, that's great.Conversely, if you don't have enough complete examples to train a useful model,then you might consider imputing missing values.

It's fine to delete useless or redundant examples, but it's bad to deleteimportant examples. Unfortunately, it can be difficult to differentiatebetween useless and useful examples. If you can't decide whetherto delete or impute, consider building two datasets: one formed by deletingincomplete examples and the other by imputing.Then, determine which dataset trains the better model.

Click the icon to learn more about imputation handling.

Clever algorithms can impute some pretty good missing values;however, imputed values are rarely as good as the actual values.Therefore, a good dataset tells the model which values are imputed andwhich are actual. One way to do this is to add an extra Boolean columnto the dataset that indicates whether a particular feature's valueis imputed. For example, given a feature namedtemperature,you could add an extra Boolean feature named something liketemperature_is_imputed. Then, during training, the model willprobably gradually learn to trust examples containing imputed values forfeaturetemperatureless than examples containingactual (non-imputed) values.


Imputation is the process of generating well-reasoned data, not random ordeceptive data. Be careful: good imputation can improve your model; badimputation can hurt your model.

One common algorithm is to use the mean or median as the imputed value.Consequently, when you represent a numerical feature withZ-scores, thenthe imputed value is typically 0 (because 0 is generally the mean Z-score).

Exercise: Check your understanding

A sorted dataset, like the one in the following exercise, can sometimes simplify imputation. However, it is a bad idea to train on a sorted dataset. So, after imputation, randomize the order of examples in the training set.

Here are two columns of a dataset sorted byTimestamp.

TimestampTemperature
June 8, 2023 09:0012
June 8, 2023 10:0018
June 8, 2023 11:00missing
June 8, 2023 12:0024
June 8, 2023 13:0038

Which of the following would be a reasonable value to impute for the missing value ofTemperature?

23
Probably. 23 is the mean of the adjacent values (12, 18, 24, and 38). However, we aren't seeing the rest of the dataset, so it is possible that 23 would be an outlier for 11:00 on other days.
31
Unlikely. The limited part of the dataset that we can see suggests that 31 is much too high for the 11:00Temperature. However, we can't be sure without basing the imputation on a larger number of examples.
51
Very unlikely. 51 is much higher than any of the displayed values (and, therefore, much higher than the mean).
Key terms:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-03 UTC.