Movatterモバイル変換


[0]ホーム

URL:


Python Pandas Tutorial

Python Pandas - Missing Data



Missing data is always a problem in real life scenarios. particularly in areas like machine learning and data analysis. Missing values can significantly impact the accuracy of models and analyses, making it crucial to address them properly. This tutorial will about how to identify and handle missing data in Python Pandas.

When and Why Is Data Missed?

Consider a scenario where an online survey is conducted for a product. Many a times, people do not share all the information related to them, they might skip some questions, leading to incomplete data. For example, some might share their experience with the product but not how long they have been using it, or vice versa. Missing data is a frequent occurrence in such real-time scenarios, and handling it effectively is essential.

Representing Missing Data in Pandas

Pandas uses different sentinel values to represent missing data (NA or NaN), depending on the data type.

  • numpy.nan: Used for NumPy data types. When missing values are introduced in an integer or boolean array, the array is upcast tonp.float64 orobject, asNaN is a floating-point value.

  • NaT: Used for missing dates and times in np.datetime64, np.timedelta64, and PeriodDtype. NaT stands for "Not a Time".

  • <NA>: A more flexible missing value representation for StringDtype, Int64Dtype, Float64Dtype, BooleanDtype, and ArrowDtype. This type preserves the original data type when missing values are introduced.

Example

Let us now see how Pandas represent the missing data for different data types.

import pandas as pdimport numpy as npser1 = pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2])ser2 = pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2])ser3 = pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2])df = pd.DataFrame({'NumPy':ser1, 'Dates':ser2, 'Others':ser3} )print(df)

Itsoutput is as follows −

NumPyDatesOthers
1.01970-01-01 00:00:00.0000000011
2.01970-01-01 00:00:00.0000000022
NaNNaT<NA>

Checking for Missing Values

Pandas provides theisna() andnotna() functions to detect missing values, which work across different data types. These functions return a Boolean Series indicating the presence of missing values.

Example

The following example detecting the missing values using theisna() method.

import pandas as pdimport numpy as npser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT])print(pd.isna(ser))

On executing the above code we will get the following output −

0    False1     Truedtype: bool

It is important to note thatNone is also treated as a missing value when usingisna() andnotna().

Calculations with Missing Data

When performingcalculations with missing data, Pandas treatsNA as zero. If all data in a calculation areNA, the result will beNA.

Example

This example calculates the sum of value in the DataFrame "one" column with the missing data.

import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print(df['one'].sum())

Itsoutput is as follows −

2.02357685917

Replacing/Filling Missing Data

Pandas provides several methods to handle missing data. One common approach is toreplace missing values with a specific value using thefillna() method.

Example

The following program shows how you can replace NaN with a scalar value ("NaN" with "0") using thefillna() method.

import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one','two', 'three'])df = df.reindex(['a', 'b', 'c'])print("Input DataFrame:\n",df)print("Resultant DataFrame after NaN replaced with '0':")print(df.fillna(0))

Itsoutput is as follows −

Input DataFrame:
onetwothree
a0.188006-0.685489-2.088354
bNaNNaNNaN
c-0.4462962.2980460.346000
Resultant DataFrame after NaN replaced with '0':
onetwothree
a0.188006-0.685489-2.088354
b0.0000000.0000000.000000
c-0.4462962.2980460.346000

Drop Missing Values

If you want to simply exclude the missing values instead of replacing then, then use thedropna() function fordropping missing values.

Example

This example removes the missing values using thedropna() function.

import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print(df.dropna())

Itsoutput is as follows −

onetwothree
a0.170497-0.118334-1.078715
c0.326345-0.1801020.700032
e1.972619-0.322132-1.405863
f1.760503-1.1792940.043965
h0.7474300.2356820.973310
Print Page
Advertisements

[8]ページ先頭

©2009-2025 Movatter.jp