Numerical data: Conclusion

  • A machine learning model's predictive ability is directly dependent on the quality of data it's trained on.

  • Numerical features often benefit from normalization or binning to improve model performance.

  • Data validation through verification tests and visualizations is crucial for identifying and addressing potential issues.

  • Understanding data distribution through statistics on both the entire dataset and its subsets is essential for identifying hidden problems.

  • Maintaining thorough documentation of all data transformations ensures reproducibility and facilitates model understanding.

A machine learning (ML) model's health is determined by its data. Feed yourmodel healthy data and it will thrive; feed your model junk and itspredictions will be worthless.

Best practices for working with numerical data:

  • Remember that your ML model interacts with the data in thefeature vector,not the data in thedataset.
  • Normalize mostnumericalfeatures.
  • If your first normalization strategy doesn't succeed, consider a differentway to normalize your data.
  • Binning, also referred to asbucketing, is sometimesbetter than normalizing.
  • Considering what your datashould look like, write verificationtests to validate those expectations. For example:
    • The absolute value of latitude should never exceed 90. You can write atest to check if a latitude value greater than 90 appears in your data.
    • If your data is restricted to the state of Florida, you can write teststo check that the latitudes fall between 24 through 31, inclusive.
  • Visualize your data with scatter plots and histograms. Look foranomalies.
  • Gather statistics not only on the entire dataset but also on smallersubsets of the dataset. That's because aggregate statistics sometimesobscure problems in smaller sections of a dataset.
  • Document all your data transformations.

Data is your most valuable resource, so treat it with care.

Additional Information

What's next

Congratulations on finishing this module!

We encourage you to explore the variousMLCC modulesat your own pace and interest. If you'd like to follow a recommended order,we suggest that you move to the following module next:Representing categorical data.


Key terms:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-25 UTC.