Numerical data: First steps Stay organized with collections Save and categorize content based on your preferences.
Page Summary
Before creating feature vectors, it is crucial to analyze numerical data by visualizing it through plots and graphs and calculating basic statistics like mean, median, and standard deviation.
Visualizations, such as scatter plots and histograms, can reveal anomalies and patterns in the data, aiding in identifying potential issues early in the data analysis process.
Outliers, values significantly distant from others, should be identified and handled appropriately, either by correcting mistakes, retaining legitimate outliers for model training, or applying techniques like clipping.
Statistical evaluation helps in understanding the distribution and characteristics of data, providing insights into potential feature and label relationships.
While basic statistics and visualizations provide valuable insights, it's essential to remain vigilant as anomalies can still exist in seemingly well-balanced data.
Before creating feature vectors, we recommend studying numerical data intwo ways:
- Visualize your data in plots or graphs.
- Get statistics about your data.
Visualize your data
Graphs can help you find anomalies or patterns hiding in the data.Therefore, before getting too far into analysis, look at yourdata graphically, either as scatter plots or histograms. View graphs notonly at the beginning of the data pipeline, but also throughout datatransformations. Visualizations help you continually check your assumptions.
We recommend working with pandas for visualization:
Note that certain visualization tools are optimized for certain data formats.A visualization tool that helps you evaluate protocol buffers may or may notbe able to help you evaluate CSV data.
Statistically evaluate your data
Beyond visual analysis, we also recommend evaluating potential features andlabels mathematically, gathering basic statistics such as:
- mean and median
- standard deviation
- the values at the quartile divisions: the 0th, 25th, 50th, 75th, and 100thpercentiles. The 0th percentile is the minimum value of this column; the100th percentile is the maximum value of this column. (The 50th percentileis the median.)
Find outliers
Anoutlier is a valuedistantfrom most other values in a feature or label. Outliers often cause problemsin model training, so finding outliers is important.
When the delta between the 0th and 25th percentiles differs significantlyfrom the delta between the 75th and 100th percentiles, the dataset probablycontains outliers.
Note: Don't over-rely on basic statistics. Anomalies can also hide inseemingly well-balanced data.Outliers can fall into any of the following categories:
- The outlier is due to amistake.For example, perhaps an experimenter mistakenly entered an extra zero,or perhaps an instrument that gathered data malfunctioned.You'll generally delete examples containing mistake outliers.
- The outlier is a legitimate data point,not a mistake.In this case, will your trained modelultimately need to infer good predictions on these outliers?
- If yes, keep these outliers in your training set. After all, outliersin certain features sometimes mirror outliers in the label, so theoutliers could actuallyhelp your model make better predictions.Be careful, extreme outliers can still hurt your model.
- If no, delete the outliers or apply more invasive feature engineeringtechniques, such asclipping.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-25 UTC.