Datasets, generalization, and overfitting Stay organized with collections Save and categorize content based on your preferences.
Page Summary
This module emphasizes the critical role of data quality in machine learning projects, highlighting that it significantly impacts model performance more than algorithm choice.
Machine learning practitioners typically dedicate a substantial portion of their project time (around 80%) to data preparation and transformation, including tasks like dataset construction and feature engineering.
The module covers key concepts in data preparation, such as identifying data characteristics, handling unreliable data, understanding data labels, and splitting datasets for training and evaluation.
Learners will gain insights into techniques for improving data quality, mitigating issues like overfitting, and interpreting loss curves to assess model performance.
This module builds upon foundational machine learning concepts, assuming familiarity with topics like linear regression, numerical and categorical data handling, and basic machine learning principles.
- Identify four different characteristics of data and datasets.
- Identify at least four different causes of data unreliability.
- Determine when to discard missing data and when to impute it.
- Differentiate between direct and derived labels.
- Identify two different ways to improve the quality of human-rated labels.
- Explain why to subdivide a dataset into a training set, validation set, and test set; identify a potential problem in data splits.
- Explain overfitting and identify three possible causes for it.
- Explain the concept of regularization. In particular, explain the following:
- Bias versus variance (adaptation to outliers…)
- L2 regularization, including Lambda (regularization rate)
- Early stopping
- Interpret different kinds of loss curves; detect convergence and overfitting in loss curves.
This module assumes you are familiar with the concepts covered in the following modules:
- Introduction to Machine Learning
- Linear regression
- Working with numerical data
- Working with categorical data
Introduction
This module begins with a leading question.Choose one of the following answers:
And here's an even more leading question:
In this module, you'll learn more about the characteristics of machine learningdatasets, and how to prepare your data to ensure high-quality results whentraining and evaluating your model.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-03 UTC.