- User Guide
- Working...
Working with missing data#
Values considered “missing”#
pandas uses different sentinel values to represent a missing (also referred to as NA)depending on the data type.
numpy.nan
for NumPy data types. The disadvantage of using NumPy data typesis that the original data type will be coerced tonp.float64
orobject
.
In [1]:pd.Series([1,2],dtype=np.int64).reindex([0,1,2])Out[1]:0 1.01 2.02 NaNdtype: float64In [2]:pd.Series([True,False],dtype=np.bool_).reindex([0,1,2])Out[2]:0 True1 False2 NaNdtype: object
NaT
for NumPynp.datetime64
,np.timedelta64
, andPeriodDtype
. For typing applications,useapi.types.NaTType
.
In [3]:pd.Series([1,2],dtype=np.dtype("timedelta64[ns]")).reindex([0,1,2])Out[3]:0 0 days 00:00:00.0000000011 0 days 00:00:00.0000000022 NaTdtype: timedelta64[ns]In [4]:pd.Series([1,2],dtype=np.dtype("datetime64[ns]")).reindex([0,1,2])Out[4]:0 1970-01-01 00:00:00.0000000011 1970-01-01 00:00:00.0000000022 NaTdtype: datetime64[ns]In [5]:pd.Series(["2020","2020"],dtype=pd.PeriodDtype("D")).reindex([0,1,2])Out[5]:0 2020-01-011 2020-01-012 NaTdtype: period[D]
NA
forStringDtype
,Int64Dtype
(and other bit widths),Float64Dtype`(andotherbitwidths),:class:`BooleanDtype
andArrowDtype
.These types will maintain the original data type of the data.For typing applications, useapi.types.NAType
.
In [6]:pd.Series([1,2],dtype="Int64").reindex([0,1,2])Out[6]:0 11 22 <NA>dtype: Int64In [7]:pd.Series([True,False],dtype="boolean[pyarrow]").reindex([0,1,2])Out[7]:0 True1 False2 <NA>dtype: bool[pyarrow]
To detect these missing value, use theisna()
ornotna()
methods.
In [8]:ser=pd.Series([pd.Timestamp("2020-01-01"),pd.NaT])In [9]:serOut[9]:0 2020-01-011 NaTdtype: datetime64[ns]In [10]:pd.isna(ser)Out[10]:0 False1 Truedtype: bool
Note
isna()
ornotna()
will also considerNone
a missing value.
In [11]:ser=pd.Series([1,None],dtype=object)In [12]:serOut[12]:0 11 Nonedtype: objectIn [13]:pd.isna(ser)Out[13]:0 False1 Truedtype: bool
Warning
Equality compaisons betweennp.nan
,NaT
, andNA
do not act likeNone
In [14]:None==None# noqa: E711Out[14]:TrueIn [15]:np.nan==np.nanOut[15]:FalseIn [16]:pd.NaT==pd.NaTOut[16]:FalseIn [17]:pd.NA==pd.NAOut[17]:<NA>
Therefore, an equality comparison between aDataFrame
orSeries
with one of these missing values does not provide the same information asisna()
ornotna()
.
In [18]:ser=pd.Series([True,None],dtype="boolean[pyarrow]")In [19]:ser==pd.NAOut[19]:0 <NA>1 <NA>dtype: bool[pyarrow]In [20]:pd.isna(ser)Out[20]:0 False1 Truedtype: bool
NA
semantics#
Warning
Experimental: the behaviour ofNA`
can still change without warning.
Starting from pandas 1.0, an experimentalNA
value (singleton) isavailable to represent scalar missing values. The goal ofNA
is provide a“missing” indicator that can be used consistently across data types(instead ofnp.nan
,None
orpd.NaT
depending on the data type).
For example, when having missing values in aSeries
with the nullable integerdtype, it will useNA
:
In [21]:s=pd.Series([1,2,None],dtype="Int64")In [22]:sOut[22]:0 11 22 <NA>dtype: Int64In [23]:s[2]Out[23]:<NA>In [24]:s[2]ispd.NAOut[24]:True
Currently, pandas does not yet use those data types usingNA
by defaultaDataFrame
orSeries
, so you need to specifythe dtype explicitly. An easy way to convert to those dtypes is explained in theconversion section.
Propagation in arithmetic and comparison operations#
In general, missing valuespropagate in operations involvingNA
. Whenone of the operands is unknown, the outcome of the operation is also unknown.
For example,NA
propagates in arithmetic operations, similarly tonp.nan
:
In [25]:pd.NA+1Out[25]:<NA>In [26]:"a"*pd.NAOut[26]:<NA>
There are a few special cases when the result is known, even when one of theoperands isNA
.
In [27]:pd.NA**0Out[27]:1In [28]:1**pd.NAOut[28]:1
In equality and comparison operations,NA
also propagates. This deviatesfrom the behaviour ofnp.nan
, where comparisons withnp.nan
alwaysreturnFalse
.
In [29]:pd.NA==1Out[29]:<NA>In [30]:pd.NA==pd.NAOut[30]:<NA>In [31]:pd.NA<2.5Out[31]:<NA>
To check if a value is equal toNA
, useisna()
In [32]:pd.isna(pd.NA)Out[32]:True
Note
An exception on this basic propagation rule arereductions (such as themean or the minimum), where pandas defaults to skipping missing values. See thecalculation section for more.
Logical operations#
For logical operations,NA
follows the rules of thethree-valued logic (orKleene logic, similarly to R, SQL and Julia). This logic means to onlypropagate missing values when it is logically required.
For example, for the logical “or” operation (|
), if one of the operandsisTrue
, we already know the result will beTrue
, regardless of theother value (so regardless the missing value would beTrue
orFalse
).In this case,NA
does not propagate:
In [33]:True|FalseOut[33]:TrueIn [34]:True|pd.NAOut[34]:TrueIn [35]:pd.NA|TrueOut[35]:True
On the other hand, if one of the operands isFalse
, the result dependson the value of the other operand. Therefore, in this caseNA
propagates:
In [36]:False|TrueOut[36]:TrueIn [37]:False|FalseOut[37]:FalseIn [38]:False|pd.NAOut[38]:<NA>
The behaviour of the logical “and” operation (&
) can be derived usingsimilar logic (where nowNA
will not propagate if one of the operandsis alreadyFalse
):
In [39]:False&TrueOut[39]:FalseIn [40]:False&FalseOut[40]:FalseIn [41]:False&pd.NAOut[41]:False
In [42]:True&TrueOut[42]:TrueIn [43]:True&FalseOut[43]:FalseIn [44]:True&pd.NAOut[44]:<NA>
NA
in a boolean context#
Since the actual value of an NA is unknown, it is ambiguous to convert NAto a boolean value.
In [45]:bool(pd.NA)---------------------------------------------------------------------------TypeErrorTraceback (most recent call last)CellIn[45],line1---->1bool(pd.NA)File missing.pyx:392, inpandas._libs.missing.NAType.__bool__()TypeError: boolean value of NA is ambiguous
This also means thatNA
cannot be used in a context where it isevaluated to a boolean, such asifcondition:...
wherecondition
canpotentially beNA
. In such cases,isna()
can be used to checkforNA
orcondition
beingNA
can be avoided, for example byfilling missing values beforehand.
A similar situation occurs when usingSeries
orDataFrame
objects inif
statements, seeUsing if/truth statements with pandas.
NumPy ufuncs#
pandas.NA
implements NumPy’s__array_ufunc__
protocol. Most ufuncswork withNA
, and generally returnNA
:
In [46]:np.log(pd.NA)Out[46]:<NA>In [47]:np.add(pd.NA,1)Out[47]:<NA>
Warning
Currently, ufuncs involving an ndarray andNA
will return anobject-dtype filled with NA values.
In [48]:a=np.array([1,2,3])In [49]:np.greater(a,pd.NA)Out[49]:array([<NA>, <NA>, <NA>], dtype=object)
The return type here may change to return a different array typein the future.
SeeDataFrame interoperability with NumPy functions for more on ufuncs.
Conversion#
If you have aDataFrame
orSeries
usingnp.nan
,Series.convert_dtypes()
andDataFrame.convert_dtypes()
inDataFrame
that can convert data to use the data types that useNA
such asInt64Dtype
orArrowDtype
. This is especially helpful after readingin data sets from IO methods where data types were inferred.
In this example, while the dtypes of all columns are changed, we show the results forthe first 10 columns.
In [50]:importioIn [51]:data=io.StringIO("a,b\n,True\n2,")In [52]:df=pd.read_csv(data)In [53]:df.dtypesOut[53]:a float64b objectdtype: objectIn [54]:df_conv=df.convert_dtypes()In [55]:df_convOut[55]: a b0 <NA> True1 2 <NA>In [56]:df_conv.dtypesOut[56]:a Int64b booleandtype: object
Inserting missing data#
You can insert missing values by simply assigning to aSeries
orDataFrame
.The missing value sentinel used will be chosen based on the dtype.
In [57]:ser=pd.Series([1.,2.,3.])In [58]:ser.loc[0]=NoneIn [59]:serOut[59]:0 NaN1 2.02 3.0dtype: float64In [60]:ser=pd.Series([pd.Timestamp("2021"),pd.Timestamp("2021")])In [61]:ser.iloc[0]=np.nanIn [62]:serOut[62]:0 NaT1 2021-01-01dtype: datetime64[ns]In [63]:ser=pd.Series([True,False],dtype="boolean[pyarrow]")In [64]:ser.iloc[0]=NoneIn [65]:serOut[65]:0 <NA>1 Falsedtype: bool[pyarrow]
Forobject
types, pandas will use the value given:
In [66]:s=pd.Series(["a","b","c"],dtype=object)In [67]:s.loc[0]=NoneIn [68]:s.loc[1]=np.nanIn [69]:sOut[69]:0 None1 NaN2 cdtype: object
Calculations with missing data#
Missing values propagate through arithmetic operations between pandas objects.
In [70]:ser1=pd.Series([np.nan,np.nan,2,3])In [71]:ser2=pd.Series([np.nan,1,np.nan,4])In [72]:ser1Out[72]:0 NaN1 NaN2 2.03 3.0dtype: float64In [73]:ser2Out[73]:0 NaN1 1.02 NaN3 4.0dtype: float64In [74]:ser1+ser2Out[74]:0 NaN1 NaN2 NaN3 7.0dtype: float64
The descriptive statistics and computational methods discussed in thedata structure overview (and listedhere andhere) are allaccount for missing data.
When summing data, NA values or empty data will be treated as zero.
In [75]:pd.Series([np.nan]).sum()Out[75]:0.0In [76]:pd.Series([],dtype="float64").sum()Out[76]:0.0
When taking the product, NA values or empty data will be treated as 1.
In [77]:pd.Series([np.nan]).prod()Out[77]:1.0In [78]:pd.Series([],dtype="float64").prod()Out[78]:1.0
Cumulative methods likecumsum()
andcumprod()
ignore NA values by default preserve them in the result. This behavior can be changedwithskipna
Cumulative methods like
cumsum()
andcumprod()
ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and include NA values, useskipna=False
.
In [79]:ser=pd.Series([1,np.nan,3,np.nan])In [80]:serOut[80]:0 1.01 NaN2 3.03 NaNdtype: float64In [81]:ser.cumsum()Out[81]:0 1.01 NaN2 4.03 NaNdtype: float64In [82]:ser.cumsum(skipna=False)Out[82]:0 1.01 NaN2 NaN3 NaNdtype: float64
Dropping missing data#
dropna()
dropa rows or columns with missing data.
In [83]:df=pd.DataFrame([[np.nan,1,2],[1,2,np.nan],[1,2,3]])In [84]:dfOut[84]: 0 1 20 NaN 1 2.01 1.0 2 NaN2 1.0 2 3.0In [85]:df.dropna()Out[85]: 0 1 22 1.0 2 3.0In [86]:df.dropna(axis=1)Out[86]: 10 11 22 2In [87]:ser=pd.Series([1,pd.NA],dtype="int64[pyarrow]")In [88]:ser.dropna()Out[88]:0 1dtype: int64[pyarrow]
Filling missing data#
Filling by value#
fillna()
replaces NA values with non-NA data.
Replace NA with a scalar value
In [89]:data={"np":[1.0,np.nan,np.nan,2],"arrow":pd.array([1.0,pd.NA,pd.NA,2],dtype="float64[pyarrow]")}In [90]:df=pd.DataFrame(data)In [91]:dfOut[91]: np arrow0 1.0 1.01 NaN <NA>2 NaN <NA>3 2.0 2.0In [92]:df.fillna(0)Out[92]: np arrow0 1.0 1.01 0.0 0.02 0.0 0.03 2.0 2.0
Fill gaps forward or backward
In [93]:df.ffill()Out[93]: np arrow0 1.0 1.01 1.0 1.02 1.0 1.03 2.0 2.0In [94]:df.bfill()Out[94]: np arrow0 1.0 1.01 2.0 2.02 2.0 2.03 2.0 2.0
Limit the number of NA values filled
In [95]:df.ffill(limit=1)Out[95]: np arrow0 1.0 1.01 1.0 1.02 NaN <NA>3 2.0 2.0
NA values can be replaced with corresponding value from aSeries
orDataFrame
where the index and column aligns between the original object and the filled object.
In [96]:dff=pd.DataFrame(np.arange(30,dtype=np.float64).reshape(10,3),columns=list("ABC"))In [97]:dff.iloc[3:5,0]=np.nanIn [98]:dff.iloc[4:6,1]=np.nanIn [99]:dff.iloc[5:8,2]=np.nanIn [100]:dffOut[100]: A B C0 0.0 1.0 2.01 3.0 4.0 5.02 6.0 7.0 8.03 NaN 10.0 11.04 NaN NaN 14.05 15.0 NaN NaN6 18.0 19.0 NaN7 21.0 22.0 NaN8 24.0 25.0 26.09 27.0 28.0 29.0In [101]:dff.fillna(dff.mean())Out[101]: A B C0 0.00 1.0 2.0000001 3.00 4.0 5.0000002 6.00 7.0 8.0000003 14.25 10.0 11.0000004 14.25 14.5 14.0000005 15.00 14.5 13.5714296 18.00 19.0 13.5714297 21.00 22.0 13.5714298 24.00 25.0 26.0000009 27.00 28.0 29.000000
Note
DataFrame.where()
can also be used to fill NA values.Same result as above.
In [102]:dff.where(pd.notna(dff),dff.mean(),axis="columns")Out[102]: A B C0 0.00 1.0 2.0000001 3.00 4.0 5.0000002 6.00 7.0 8.0000003 14.25 10.0 11.0000004 14.25 14.5 14.0000005 15.00 14.5 13.5714296 18.00 19.0 13.5714297 21.00 22.0 13.5714298 24.00 25.0 26.0000009 27.00 28.0 29.000000
Interpolation#
DataFrame.interpolate()
andSeries.interpolate()
fills NA valuesusing various interpolation methods.
In [103]:df=pd.DataFrame( .....:{ .....:"A":[1,2.1,np.nan,4.7,5.6,6.8], .....:"B":[0.25,np.nan,np.nan,4,12.2,14.4], .....:} .....:) .....:In [104]:dfOut[104]: A B0 1.0 0.251 2.1 NaN2 NaN NaN3 4.7 4.004 5.6 12.205 6.8 14.40In [105]:df.interpolate()Out[105]: A B0 1.0 0.251 2.1 1.502 3.4 2.753 4.7 4.004 5.6 12.205 6.8 14.40In [106]:idx=pd.date_range("2020-01-01",periods=10,freq="D")In [107]:data=np.random.default_rng(2).integers(0,10,10).astype(np.float64)In [108]:ts=pd.Series(data,index=idx)In [109]:ts.iloc[[1,2,5,6,9]]=np.nanIn [110]:tsOut[110]:2020-01-01 8.02020-01-02 NaN2020-01-03 NaN2020-01-04 2.02020-01-05 4.02020-01-06 NaN2020-01-07 NaN2020-01-08 0.02020-01-09 3.02020-01-10 NaNFreq: D, dtype: float64In [111]:ts.plot()Out[111]:<Axes: >

In [112]:ts.interpolate()Out[112]:2020-01-01 8.0000002020-01-02 6.0000002020-01-03 4.0000002020-01-04 2.0000002020-01-05 4.0000002020-01-06 2.6666672020-01-07 1.3333332020-01-08 0.0000002020-01-09 3.0000002020-01-10 3.000000Freq: D, dtype: float64In [113]:ts.interpolate().plot()Out[113]:<Axes: >

Interpolation relative to aTimestamp
in theDatetimeIndex
is available by settingmethod="time"
In [114]:ts2=ts.iloc[[0,1,3,7,9]]In [115]:ts2Out[115]:2020-01-01 8.02020-01-02 NaN2020-01-04 2.02020-01-08 0.02020-01-10 NaNdtype: float64In [116]:ts2.interpolate()Out[116]:2020-01-01 8.02020-01-02 5.02020-01-04 2.02020-01-08 0.02020-01-10 0.0dtype: float64In [117]:ts2.interpolate(method="time")Out[117]:2020-01-01 8.02020-01-02 6.02020-01-04 2.02020-01-08 0.02020-01-10 0.0dtype: float64
For a floating-point index, usemethod='values'
:
In [118]:idx=[0.0,1.0,10.0]In [119]:ser=pd.Series([0.0,np.nan,10.0],idx)In [120]:serOut[120]:0.0 0.01.0 NaN10.0 10.0dtype: float64In [121]:ser.interpolate()Out[121]:0.0 0.01.0 5.010.0 10.0dtype: float64In [122]:ser.interpolate(method="values")Out[122]:0.0 0.01.0 1.010.0 10.0dtype: float64
If you havescipy installed, you can pass the name of a 1-d interpolation routine tomethod
.as specified in the scipy interpolationdocumentation and referenceguide.The appropriate interpolation method will depend on the data type.
Tip
If you are dealing with a time series that is growing at an increasing rate,usemethod='barycentric'
.
If you have values approximating a cumulative distribution function,usemethod='pchip'
.
To fill missing values with goal of smooth plotting usemethod='akima'
.
In [123]:df=pd.DataFrame( .....:{ .....:"A":[1,2.1,np.nan,4.7,5.6,6.8], .....:"B":[0.25,np.nan,np.nan,4,12.2,14.4], .....:} .....:) .....:In [124]:dfOut[124]: A B0 1.0 0.251 2.1 NaN2 NaN NaN3 4.7 4.004 5.6 12.205 6.8 14.40In [125]:df.interpolate(method="barycentric")Out[125]: A B0 1.00 0.2501 2.10 -7.6602 3.53 -4.5153 4.70 4.0004 5.60 12.2005 6.80 14.400In [126]:df.interpolate(method="pchip")Out[126]: A B0 1.00000 0.2500001 2.10000 0.6728082 3.43454 1.9289503 4.70000 4.0000004 5.60000 12.2000005 6.80000 14.400000In [127]:df.interpolate(method="akima")Out[127]: A B0 1.000000 0.2500001 2.100000 -0.8733162 3.406667 0.3200343 4.700000 4.0000004 5.600000 12.2000005 6.800000 14.400000
When interpolating via a polynomial or spline approximation, you must also specifythe degree or order of the approximation:
In [128]:df.interpolate(method="spline",order=2)Out[128]: A B0 1.000000 0.2500001 2.100000 -0.4285982 3.404545 1.2069003 4.700000 4.0000004 5.600000 12.2000005 6.800000 14.400000In [129]:df.interpolate(method="polynomial",order=2)Out[129]: A B0 1.000000 0.2500001 2.100000 -2.7038462 3.451351 -1.4538463 4.700000 4.0000004 5.600000 12.2000005 6.800000 14.400000
Comparing several methods.
In [130]:np.random.seed(2)In [131]:ser=pd.Series(np.arange(1,10.1,0.25)**2+np.random.randn(37))In [132]:missing=np.array([4,13,14,15,16,17,18,20,29])In [133]:ser.iloc[missing]=np.nanIn [134]:methods=["linear","quadratic","cubic"]In [135]:df=pd.DataFrame({m:ser.interpolate(method=m)forminmethods})In [136]:df.plot()Out[136]:<Axes: >

Interpolating new observations from expanding data withSeries.reindex()
.
In [137]:ser=pd.Series(np.sort(np.random.uniform(size=100)))# interpolate at new_indexIn [138]:new_index=ser.index.union(pd.Index([49.25,49.5,49.75,50.25,50.5,50.75]))In [139]:interp_s=ser.reindex(new_index).interpolate(method="pchip")In [140]:interp_s.loc[49:51]Out[140]:49.00 0.47141049.25 0.47684149.50 0.48178049.75 0.48599850.00 0.48926650.25 0.49181450.50 0.49399550.75 0.49576351.00 0.497074dtype: float64
Interpolation limits#
interpolate()
accepts alimit
keywordargument to limit the number of consecutiveNaN
valuesfilled since the last valid observation
In [141]:ser=pd.Series([np.nan,np.nan,5,np.nan,np.nan,np.nan,13,np.nan,np.nan])In [142]:serOut[142]:0 NaN1 NaN2 5.03 NaN4 NaN5 NaN6 13.07 NaN8 NaNdtype: float64In [143]:ser.interpolate()Out[143]:0 NaN1 NaN2 5.03 7.04 9.05 11.06 13.07 13.08 13.0dtype: float64In [144]:ser.interpolate(limit=1)Out[144]:0 NaN1 NaN2 5.03 7.04 NaN5 NaN6 13.07 13.08 NaNdtype: float64
By default,NaN
values are filled in aforward
direction. Uselimit_direction
parameter to fillbackward
or fromboth
directions.
In [145]:ser.interpolate(limit=1,limit_direction="backward")Out[145]:0 NaN1 5.02 5.03 NaN4 NaN5 11.06 13.07 NaN8 NaNdtype: float64In [146]:ser.interpolate(limit=1,limit_direction="both")Out[146]:0 NaN1 5.02 5.03 7.04 NaN5 11.06 13.07 13.08 NaNdtype: float64In [147]:ser.interpolate(limit_direction="both")Out[147]:0 5.01 5.02 5.03 7.04 9.05 11.06 13.07 13.08 13.0dtype: float64
By default,NaN
values are filled whether they are surrounded byexisting valid values or outside existing valid values. Thelimit_area
parameter restricts filling to either inside or outside values.
# fill one consecutive inside value in both directionsIn [148]:ser.interpolate(limit_direction="both",limit_area="inside",limit=1)Out[148]:0 NaN1 NaN2 5.03 7.04 NaN5 11.06 13.07 NaN8 NaNdtype: float64# fill all consecutive outside values backwardIn [149]:ser.interpolate(limit_direction="backward",limit_area="outside")Out[149]:0 5.01 5.02 5.03 NaN4 NaN5 NaN6 13.07 NaN8 NaNdtype: float64# fill all consecutive outside values in both directionsIn [150]:ser.interpolate(limit_direction="both",limit_area="outside")Out[150]:0 5.01 5.02 5.03 NaN4 NaN5 NaN6 13.07 13.08 13.0dtype: float64
Replacing values#
Series.replace()
andDataFrame.replace()
can be used similar toSeries.fillna()
andDataFrame.fillna()
to replace or insert missing values.
In [151]:df=pd.DataFrame(np.eye(3))In [152]:dfOut[152]: 0 1 20 1.0 0.0 0.01 0.0 1.0 0.02 0.0 0.0 1.0In [153]:df_missing=df.replace(0,np.nan)In [154]:df_missingOut[154]: 0 1 20 1.0 NaN NaN1 NaN 1.0 NaN2 NaN NaN 1.0In [155]:df_filled=df_missing.replace(np.nan,2)In [156]:df_filledOut[156]: 0 1 20 1.0 2.0 2.01 2.0 1.0 2.02 2.0 2.0 1.0
Replacing more than one value is possible by passing a list.
In [157]:df_filled.replace([1,44],[2,28])Out[157]: 0 1 20 2.0 2.0 2.01 2.0 2.0 2.02 2.0 2.0 2.0
Replacing using a mapping dict.
In [158]:df_filled.replace({1:44,2:28})Out[158]: 0 1 20 44.0 28.0 28.01 28.0 44.0 28.02 28.0 28.0 44.0
Regular expression replacement#
Note
Python strings prefixed with ther
character such asr'helloworld'
are“raw” strings.They have different semantics regarding backslashes than strings without this prefix.Backslashes in raw strings will be interpreted as an escaped backslash, e.g.,r'\'=='\\'
.
Replace the ‘.’ withNaN
In [159]:d={"a":list(range(4)),"b":list("ab.."),"c":["a","b",np.nan,"d"]}In [160]:df=pd.DataFrame(d)In [161]:df.replace(".",np.nan)Out[161]: a b c0 0 a a1 1 b b2 2 NaN NaN3 3 NaN d
Replace the ‘.’ withNaN
with regular expression that removes surrounding whitespace
In [162]:df.replace(r"\s*\.\s*",np.nan,regex=True)Out[162]: a b c0 0 a a1 1 b b2 2 NaN NaN3 3 NaN d
Replace with a list of regexes.
In [163]:df.replace([r"\.",r"(a)"],["dot",r"\1stuff"],regex=True)Out[163]: a b c0 0 astuff astuff1 1 b b2 2 dot NaN3 3 dot d
Replace with a regex in a mapping dict.
In [164]:df.replace({"b":r"\s*\.\s*"},{"b":np.nan},regex=True)Out[164]: a b c0 0 a a1 1 b b2 2 NaN NaN3 3 NaN d
Pass nested dictionaries of regular expressions that use theregex
keyword.
In [165]:df.replace({"b":{"b":r""}},regex=True)Out[165]: a b c0 0 a a1 1 b2 2 . NaN3 3 . dIn [166]:df.replace(regex={"b":{r"\s*\.\s*":np.nan}})Out[166]: a b c0 0 a a1 1 b b2 2 NaN NaN3 3 NaN dIn [167]:df.replace({"b":r"\s*(\.)\s*"},{"b":r"\1ty"},regex=True)Out[167]: a b c0 0 a a1 1 b b2 2 .ty NaN3 3 .ty d
Pass a list of regular expressions that will replace matches with a scalar.
In [168]:df.replace([r"\s*\.\s*",r"a|b"],"placeholder",regex=True)Out[168]: a b c0 0 placeholder placeholder1 1 placeholder placeholder2 2 placeholder NaN3 3 placeholder d
All of the regular expression examples can also be passed with theto_replace
argument as theregex
argument. In this case thevalue
argument must be passed explicitly by name orregex
must be a nesteddictionary.
In [169]:df.replace(regex=[r"\s*\.\s*",r"a|b"],value="placeholder")Out[169]: a b c0 0 placeholder placeholder1 1 placeholder placeholder2 2 placeholder NaN3 3 placeholder d
Note
A regular expression object fromre.compile
is a valid input as well.