
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas - Missing Data
Missing data is always a problem in real life scenarios. particularly in areas like machine learning and data analysis. Missing values can significantly impact the accuracy of models and analyses, making it crucial to address them properly. This tutorial will about how to identify and handle missing data in Python Pandas.
When and Why Is Data Missed?
Consider a scenario where an online survey is conducted for a product. Many a times, people do not share all the information related to them, they might skip some questions, leading to incomplete data. For example, some might share their experience with the product but not how long they have been using it, or vice versa. Missing data is a frequent occurrence in such real-time scenarios, and handling it effectively is essential.
Representing Missing Data in Pandas
Pandas uses different sentinel values to represent missing data (NA or NaN), depending on the data type.
numpy.nan: Used for NumPy data types. When missing values are introduced in an integer or boolean array, the array is upcast tonp.float64 orobject, asNaN is a floating-point value.
NaT: Used for missing dates and times in np.datetime64, np.timedelta64, and PeriodDtype. NaT stands for "Not a Time".
<NA>: A more flexible missing value representation for StringDtype, Int64Dtype, Float64Dtype, BooleanDtype, and ArrowDtype. This type preserves the original data type when missing values are introduced.
Example
Let us now see how Pandas represent the missing data for different data types.
import pandas as pdimport numpy as npser1 = pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2])ser2 = pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2])ser3 = pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2])df = pd.DataFrame({'NumPy':ser1, 'Dates':ser2, 'Others':ser3} )print(df)Itsoutput is as follows −
| NumPy | Dates | Others |
|---|---|---|
| 1.0 | 1970-01-01 00:00:00.000000001 | 1 |
| 2.0 | 1970-01-01 00:00:00.000000002 | 2 |
| NaN | NaT | <NA> |
Checking for Missing Values
Pandas provides theisna() andnotna() functions to detect missing values, which work across different data types. These functions return a Boolean Series indicating the presence of missing values.
Example
The following example detecting the missing values using theisna() method.
import pandas as pdimport numpy as npser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT])print(pd.isna(ser))On executing the above code we will get the following output −
0 False1 Truedtype: bool
It is important to note thatNone is also treated as a missing value when usingisna() andnotna().
Calculations with Missing Data
When performingcalculations with missing data, Pandas treatsNA as zero. If all data in a calculation areNA, the result will beNA.
Example
This example calculates the sum of value in the DataFrame "one" column with the missing data.
import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print(df['one'].sum())
Itsoutput is as follows −
2.02357685917
Replacing/Filling Missing Data
Pandas provides several methods to handle missing data. One common approach is toreplace missing values with a specific value using thefillna() method.
Example
The following program shows how you can replace NaN with a scalar value ("NaN" with "0") using thefillna() method.
import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one','two', 'three'])df = df.reindex(['a', 'b', 'c'])print("Input DataFrame:\n",df)print("Resultant DataFrame after NaN replaced with '0':")print(df.fillna(0))Itsoutput is as follows −
Input DataFrame:
| one | two | three | |
|---|---|---|---|
| a | 0.188006 | -0.685489 | -2.088354 |
| b | NaN | NaN | NaN |
| c | -0.446296 | 2.298046 | 0.346000 |
| one | two | three | |
|---|---|---|---|
| a | 0.188006 | -0.685489 | -2.088354 |
| b | 0.000000 | 0.000000 | 0.000000 |
| c | -0.446296 | 2.298046 | 0.346000 |
Drop Missing Values
If you want to simply exclude the missing values instead of replacing then, then use thedropna() function fordropping missing values.
Example
This example removes the missing values using thedropna() function.
import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print(df.dropna())
Itsoutput is as follows −
| one | two | three | |
|---|---|---|---|
| a | 0.170497 | -0.118334 | -1.078715 |
| c | 0.326345 | -0.180102 | 0.700032 |
| e | 1.972619 | -0.322132 | -1.405863 |
| f | 1.760503 | -1.179294 | 0.043965 |
| h | 0.747430 | 0.235682 | 0.973310 |