Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Ctrl+K

Working with missing data#

Values considered “missing”#

pandas uses different sentinel values to represent a missing (also referred to as NA)depending on the data type.

numpy.nan for NumPy data types. The disadvantage of using NumPy data typesis that the original data type will be coerced tonp.float64 orobject.

In [1]:pd.Series([1,2],dtype=np.int64).reindex([0,1,2])Out[1]:0    1.01    2.02    NaNdtype: float64In [2]:pd.Series([True,False],dtype=np.bool_).reindex([0,1,2])Out[2]:0     True1    False2      NaNdtype: object

NaT for NumPynp.datetime64,np.timedelta64, andPeriodDtype. For typing applications,useapi.types.NaTType.

In [3]:pd.Series([1,2],dtype=np.dtype("timedelta64[ns]")).reindex([0,1,2])Out[3]:0   0 days 00:00:00.0000000011   0 days 00:00:00.0000000022                         NaTdtype: timedelta64[ns]In [4]:pd.Series([1,2],dtype=np.dtype("datetime64[ns]")).reindex([0,1,2])Out[4]:0   1970-01-01 00:00:00.0000000011   1970-01-01 00:00:00.0000000022                             NaTdtype: datetime64[ns]In [5]:pd.Series(["2020","2020"],dtype=pd.PeriodDtype("D")).reindex([0,1,2])Out[5]:0    2020-01-011    2020-01-012           NaTdtype: period[D]

NA forStringDtype,Int64Dtype (and other bit widths),Float64Dtype`(andotherbitwidths),:class:`BooleanDtype andArrowDtype.These types will maintain the original data type of the data.For typing applications, useapi.types.NAType.

In [6]:pd.Series([1,2],dtype="Int64").reindex([0,1,2])Out[6]:0       11       22    <NA>dtype: Int64In [7]:pd.Series([True,False],dtype="boolean[pyarrow]").reindex([0,1,2])Out[7]:0     True1    False2     <NA>dtype: bool[pyarrow]

To detect these missing value, use theisna() ornotna() methods.

In [8]:ser=pd.Series([pd.Timestamp("2020-01-01"),pd.NaT])In [9]:serOut[9]:0   2020-01-011          NaTdtype: datetime64[ns]In [10]:pd.isna(ser)Out[10]:0    False1     Truedtype: bool

Note

isna() ornotna() will also considerNone a missing value.

In [11]:ser=pd.Series([1,None],dtype=object)In [12]:serOut[12]:0       11    Nonedtype: objectIn [13]:pd.isna(ser)Out[13]:0    False1     Truedtype: bool

Warning

Equality compaisons betweennp.nan,NaT, andNAdo not act likeNone

In [14]:None==None# noqa: E711Out[14]:TrueIn [15]:np.nan==np.nanOut[15]:FalseIn [16]:pd.NaT==pd.NaTOut[16]:FalseIn [17]:pd.NA==pd.NAOut[17]:<NA>

Therefore, an equality comparison between aDataFrame orSerieswith one of these missing values does not provide the same information asisna() ornotna().

In [18]:ser=pd.Series([True,None],dtype="boolean[pyarrow]")In [19]:ser==pd.NAOut[19]:0    <NA>1    <NA>dtype: bool[pyarrow]In [20]:pd.isna(ser)Out[20]:0    False1     Truedtype: bool

NA semantics#

Warning

Experimental: the behaviour ofNA` can still change without warning.

Starting from pandas 1.0, an experimentalNA value (singleton) isavailable to represent scalar missing values. The goal ofNA is provide a“missing” indicator that can be used consistently across data types(instead ofnp.nan,None orpd.NaT depending on the data type).

For example, when having missing values in aSeries with the nullable integerdtype, it will useNA:

In [21]:s=pd.Series([1,2,None],dtype="Int64")In [22]:sOut[22]:0       11       22    <NA>dtype: Int64In [23]:s[2]Out[23]:<NA>In [24]:s[2]ispd.NAOut[24]:True

Currently, pandas does not yet use those data types usingNA by defaultaDataFrame orSeries, so you need to specifythe dtype explicitly. An easy way to convert to those dtypes is explained in theconversion section.

Propagation in arithmetic and comparison operations#

In general, missing valuespropagate in operations involvingNA. Whenone of the operands is unknown, the outcome of the operation is also unknown.

For example,NA propagates in arithmetic operations, similarly tonp.nan:

In [25]:pd.NA+1Out[25]:<NA>In [26]:"a"*pd.NAOut[26]:<NA>

There are a few special cases when the result is known, even when one of theoperands isNA.

In [27]:pd.NA**0Out[27]:1In [28]:1**pd.NAOut[28]:1

In equality and comparison operations,NA also propagates. This deviatesfrom the behaviour ofnp.nan, where comparisons withnp.nan alwaysreturnFalse.

In [29]:pd.NA==1Out[29]:<NA>In [30]:pd.NA==pd.NAOut[30]:<NA>In [31]:pd.NA<2.5Out[31]:<NA>

To check if a value is equal toNA, useisna()

In [32]:pd.isna(pd.NA)Out[32]:True

Note

An exception on this basic propagation rule arereductions (such as themean or the minimum), where pandas defaults to skipping missing values. See thecalculation section for more.

Logical operations#

For logical operations,NA follows the rules of thethree-valued logic (orKleene logic, similarly to R, SQL and Julia). This logic means to onlypropagate missing values when it is logically required.

For example, for the logical “or” operation (|), if one of the operandsisTrue, we already know the result will beTrue, regardless of theother value (so regardless the missing value would beTrue orFalse).In this case,NA does not propagate:

In [33]:True|FalseOut[33]:TrueIn [34]:True|pd.NAOut[34]:TrueIn [35]:pd.NA|TrueOut[35]:True

On the other hand, if one of the operands isFalse, the result dependson the value of the other operand. Therefore, in this caseNApropagates:

In [36]:False|TrueOut[36]:TrueIn [37]:False|FalseOut[37]:FalseIn [38]:False|pd.NAOut[38]:<NA>

The behaviour of the logical “and” operation (&) can be derived usingsimilar logic (where nowNA will not propagate if one of the operandsis alreadyFalse):

In [39]:False&TrueOut[39]:FalseIn [40]:False&FalseOut[40]:FalseIn [41]:False&pd.NAOut[41]:False
In [42]:True&TrueOut[42]:TrueIn [43]:True&FalseOut[43]:FalseIn [44]:True&pd.NAOut[44]:<NA>

NA in a boolean context#

Since the actual value of an NA is unknown, it is ambiguous to convert NAto a boolean value.

In [45]:bool(pd.NA)---------------------------------------------------------------------------TypeErrorTraceback (most recent call last)CellIn[45],line1---->1bool(pd.NA)File missing.pyx:392, inpandas._libs.missing.NAType.__bool__()TypeError: boolean value of NA is ambiguous

This also means thatNA cannot be used in a context where it isevaluated to a boolean, such asifcondition:... wherecondition canpotentially beNA. In such cases,isna() can be used to checkforNA orcondition beingNA can be avoided, for example byfilling missing values beforehand.

A similar situation occurs when usingSeries orDataFrame objects inifstatements, seeUsing if/truth statements with pandas.

NumPy ufuncs#

pandas.NA implements NumPy’s__array_ufunc__ protocol. Most ufuncswork withNA, and generally returnNA:

In [46]:np.log(pd.NA)Out[46]:<NA>In [47]:np.add(pd.NA,1)Out[47]:<NA>

Warning

Currently, ufuncs involving an ndarray andNA will return anobject-dtype filled with NA values.

In [48]:a=np.array([1,2,3])In [49]:np.greater(a,pd.NA)Out[49]:array([<NA>, <NA>, <NA>], dtype=object)

The return type here may change to return a different array typein the future.

SeeDataFrame interoperability with NumPy functions for more on ufuncs.

Conversion#

If you have aDataFrame orSeries usingnp.nan,Series.convert_dtypes() andDataFrame.convert_dtypes()inDataFrame that can convert data to use the data types that useNAsuch asInt64Dtype orArrowDtype. This is especially helpful after readingin data sets from IO methods where data types were inferred.

In this example, while the dtypes of all columns are changed, we show the results forthe first 10 columns.

In [50]:importioIn [51]:data=io.StringIO("a,b\n,True\n2,")In [52]:df=pd.read_csv(data)In [53]:df.dtypesOut[53]:a    float64b     objectdtype: objectIn [54]:df_conv=df.convert_dtypes()In [55]:df_convOut[55]:      a     b0  <NA>  True1     2  <NA>In [56]:df_conv.dtypesOut[56]:a      Int64b    booleandtype: object

Inserting missing data#

You can insert missing values by simply assigning to aSeries orDataFrame.The missing value sentinel used will be chosen based on the dtype.

In [57]:ser=pd.Series([1.,2.,3.])In [58]:ser.loc[0]=NoneIn [59]:serOut[59]:0    NaN1    2.02    3.0dtype: float64In [60]:ser=pd.Series([pd.Timestamp("2021"),pd.Timestamp("2021")])In [61]:ser.iloc[0]=np.nanIn [62]:serOut[62]:0          NaT1   2021-01-01dtype: datetime64[ns]In [63]:ser=pd.Series([True,False],dtype="boolean[pyarrow]")In [64]:ser.iloc[0]=NoneIn [65]:serOut[65]:0     <NA>1    Falsedtype: bool[pyarrow]

Forobject types, pandas will use the value given:

In [66]:s=pd.Series(["a","b","c"],dtype=object)In [67]:s.loc[0]=NoneIn [68]:s.loc[1]=np.nanIn [69]:sOut[69]:0    None1     NaN2       cdtype: object

Calculations with missing data#

Missing values propagate through arithmetic operations between pandas objects.

In [70]:ser1=pd.Series([np.nan,np.nan,2,3])In [71]:ser2=pd.Series([np.nan,1,np.nan,4])In [72]:ser1Out[72]:0    NaN1    NaN2    2.03    3.0dtype: float64In [73]:ser2Out[73]:0    NaN1    1.02    NaN3    4.0dtype: float64In [74]:ser1+ser2Out[74]:0    NaN1    NaN2    NaN3    7.0dtype: float64

The descriptive statistics and computational methods discussed in thedata structure overview (and listedhere andhere) are allaccount for missing data.

When summing data, NA values or empty data will be treated as zero.

In [75]:pd.Series([np.nan]).sum()Out[75]:0.0In [76]:pd.Series([],dtype="float64").sum()Out[76]:0.0

When taking the product, NA values or empty data will be treated as 1.

In [77]:pd.Series([np.nan]).prod()Out[77]:1.0In [78]:pd.Series([],dtype="float64").prod()Out[78]:1.0

Cumulative methods likecumsum() andcumprod()ignore NA values by default preserve them in the result. This behavior can be changedwithskipna

  • Cumulative methods likecumsum() andcumprod() ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and include NA values, useskipna=False.

In [79]:ser=pd.Series([1,np.nan,3,np.nan])In [80]:serOut[80]:0    1.01    NaN2    3.03    NaNdtype: float64In [81]:ser.cumsum()Out[81]:0    1.01    NaN2    4.03    NaNdtype: float64In [82]:ser.cumsum(skipna=False)Out[82]:0    1.01    NaN2    NaN3    NaNdtype: float64

Dropping missing data#

dropna() dropa rows or columns with missing data.

In [83]:df=pd.DataFrame([[np.nan,1,2],[1,2,np.nan],[1,2,3]])In [84]:dfOut[84]:     0  1    20  NaN  1  2.01  1.0  2  NaN2  1.0  2  3.0In [85]:df.dropna()Out[85]:     0  1    22  1.0  2  3.0In [86]:df.dropna(axis=1)Out[86]:   10  11  22  2In [87]:ser=pd.Series([1,pd.NA],dtype="int64[pyarrow]")In [88]:ser.dropna()Out[88]:0    1dtype: int64[pyarrow]

Filling missing data#

Filling by value#

fillna() replaces NA values with non-NA data.

Replace NA with a scalar value

In [89]:data={"np":[1.0,np.nan,np.nan,2],"arrow":pd.array([1.0,pd.NA,pd.NA,2],dtype="float64[pyarrow]")}In [90]:df=pd.DataFrame(data)In [91]:dfOut[91]:    np  arrow0  1.0    1.01  NaN   <NA>2  NaN   <NA>3  2.0    2.0In [92]:df.fillna(0)Out[92]:    np  arrow0  1.0    1.01  0.0    0.02  0.0    0.03  2.0    2.0

Fill gaps forward or backward

In [93]:df.ffill()Out[93]:    np  arrow0  1.0    1.01  1.0    1.02  1.0    1.03  2.0    2.0In [94]:df.bfill()Out[94]:    np  arrow0  1.0    1.01  2.0    2.02  2.0    2.03  2.0    2.0

Limit the number of NA values filled

In [95]:df.ffill(limit=1)Out[95]:    np  arrow0  1.0    1.01  1.0    1.02  NaN   <NA>3  2.0    2.0

NA values can be replaced with corresponding value from aSeries orDataFramewhere the index and column aligns between the original object and the filled object.

In [96]:dff=pd.DataFrame(np.arange(30,dtype=np.float64).reshape(10,3),columns=list("ABC"))In [97]:dff.iloc[3:5,0]=np.nanIn [98]:dff.iloc[4:6,1]=np.nanIn [99]:dff.iloc[5:8,2]=np.nanIn [100]:dffOut[100]:      A     B     C0   0.0   1.0   2.01   3.0   4.0   5.02   6.0   7.0   8.03   NaN  10.0  11.04   NaN   NaN  14.05  15.0   NaN   NaN6  18.0  19.0   NaN7  21.0  22.0   NaN8  24.0  25.0  26.09  27.0  28.0  29.0In [101]:dff.fillna(dff.mean())Out[101]:       A     B          C0   0.00   1.0   2.0000001   3.00   4.0   5.0000002   6.00   7.0   8.0000003  14.25  10.0  11.0000004  14.25  14.5  14.0000005  15.00  14.5  13.5714296  18.00  19.0  13.5714297  21.00  22.0  13.5714298  24.00  25.0  26.0000009  27.00  28.0  29.000000

Note

DataFrame.where() can also be used to fill NA values.Same result as above.

In [102]:dff.where(pd.notna(dff),dff.mean(),axis="columns")Out[102]:       A     B          C0   0.00   1.0   2.0000001   3.00   4.0   5.0000002   6.00   7.0   8.0000003  14.25  10.0  11.0000004  14.25  14.5  14.0000005  15.00  14.5  13.5714296  18.00  19.0  13.5714297  21.00  22.0  13.5714298  24.00  25.0  26.0000009  27.00  28.0  29.000000

Interpolation#

DataFrame.interpolate() andSeries.interpolate() fills NA valuesusing various interpolation methods.

In [103]:df=pd.DataFrame(   .....:{   .....:"A":[1,2.1,np.nan,4.7,5.6,6.8],   .....:"B":[0.25,np.nan,np.nan,4,12.2,14.4],   .....:}   .....:)   .....:In [104]:dfOut[104]:     A      B0  1.0   0.251  2.1    NaN2  NaN    NaN3  4.7   4.004  5.6  12.205  6.8  14.40In [105]:df.interpolate()Out[105]:     A      B0  1.0   0.251  2.1   1.502  3.4   2.753  4.7   4.004  5.6  12.205  6.8  14.40In [106]:idx=pd.date_range("2020-01-01",periods=10,freq="D")In [107]:data=np.random.default_rng(2).integers(0,10,10).astype(np.float64)In [108]:ts=pd.Series(data,index=idx)In [109]:ts.iloc[[1,2,5,6,9]]=np.nanIn [110]:tsOut[110]:2020-01-01    8.02020-01-02    NaN2020-01-03    NaN2020-01-04    2.02020-01-05    4.02020-01-06    NaN2020-01-07    NaN2020-01-08    0.02020-01-09    3.02020-01-10    NaNFreq: D, dtype: float64In [111]:ts.plot()Out[111]:<Axes: >
../_images/series_before_interpolate.png
In [112]:ts.interpolate()Out[112]:2020-01-01    8.0000002020-01-02    6.0000002020-01-03    4.0000002020-01-04    2.0000002020-01-05    4.0000002020-01-06    2.6666672020-01-07    1.3333332020-01-08    0.0000002020-01-09    3.0000002020-01-10    3.000000Freq: D, dtype: float64In [113]:ts.interpolate().plot()Out[113]:<Axes: >
../_images/series_interpolate.png

Interpolation relative to aTimestamp in theDatetimeIndexis available by settingmethod="time"

In [114]:ts2=ts.iloc[[0,1,3,7,9]]In [115]:ts2Out[115]:2020-01-01    8.02020-01-02    NaN2020-01-04    2.02020-01-08    0.02020-01-10    NaNdtype: float64In [116]:ts2.interpolate()Out[116]:2020-01-01    8.02020-01-02    5.02020-01-04    2.02020-01-08    0.02020-01-10    0.0dtype: float64In [117]:ts2.interpolate(method="time")Out[117]:2020-01-01    8.02020-01-02    6.02020-01-04    2.02020-01-08    0.02020-01-10    0.0dtype: float64

For a floating-point index, usemethod='values':

In [118]:idx=[0.0,1.0,10.0]In [119]:ser=pd.Series([0.0,np.nan,10.0],idx)In [120]:serOut[120]:0.0      0.01.0      NaN10.0    10.0dtype: float64In [121]:ser.interpolate()Out[121]:0.0      0.01.0      5.010.0    10.0dtype: float64In [122]:ser.interpolate(method="values")Out[122]:0.0      0.01.0      1.010.0    10.0dtype: float64

If you havescipy installed, you can pass the name of a 1-d interpolation routine tomethod.as specified in the scipy interpolationdocumentation and referenceguide.The appropriate interpolation method will depend on the data type.

Tip

If you are dealing with a time series that is growing at an increasing rate,usemethod='barycentric'.

If you have values approximating a cumulative distribution function,usemethod='pchip'.

To fill missing values with goal of smooth plotting usemethod='akima'.

In [123]:df=pd.DataFrame(   .....:{   .....:"A":[1,2.1,np.nan,4.7,5.6,6.8],   .....:"B":[0.25,np.nan,np.nan,4,12.2,14.4],   .....:}   .....:)   .....:In [124]:dfOut[124]:     A      B0  1.0   0.251  2.1    NaN2  NaN    NaN3  4.7   4.004  5.6  12.205  6.8  14.40In [125]:df.interpolate(method="barycentric")Out[125]:      A       B0  1.00   0.2501  2.10  -7.6602  3.53  -4.5153  4.70   4.0004  5.60  12.2005  6.80  14.400In [126]:df.interpolate(method="pchip")Out[126]:         A          B0  1.00000   0.2500001  2.10000   0.6728082  3.43454   1.9289503  4.70000   4.0000004  5.60000  12.2000005  6.80000  14.400000In [127]:df.interpolate(method="akima")Out[127]:          A          B0  1.000000   0.2500001  2.100000  -0.8733162  3.406667   0.3200343  4.700000   4.0000004  5.600000  12.2000005  6.800000  14.400000

When interpolating via a polynomial or spline approximation, you must also specifythe degree or order of the approximation:

In [128]:df.interpolate(method="spline",order=2)Out[128]:          A          B0  1.000000   0.2500001  2.100000  -0.4285982  3.404545   1.2069003  4.700000   4.0000004  5.600000  12.2000005  6.800000  14.400000In [129]:df.interpolate(method="polynomial",order=2)Out[129]:          A          B0  1.000000   0.2500001  2.100000  -2.7038462  3.451351  -1.4538463  4.700000   4.0000004  5.600000  12.2000005  6.800000  14.400000

Comparing several methods.

In [130]:np.random.seed(2)In [131]:ser=pd.Series(np.arange(1,10.1,0.25)**2+np.random.randn(37))In [132]:missing=np.array([4,13,14,15,16,17,18,20,29])In [133]:ser.iloc[missing]=np.nanIn [134]:methods=["linear","quadratic","cubic"]In [135]:df=pd.DataFrame({m:ser.interpolate(method=m)forminmethods})In [136]:df.plot()Out[136]:<Axes: >
../_images/compare_interpolations.png

Interpolating new observations from expanding data withSeries.reindex().

In [137]:ser=pd.Series(np.sort(np.random.uniform(size=100)))# interpolate at new_indexIn [138]:new_index=ser.index.union(pd.Index([49.25,49.5,49.75,50.25,50.5,50.75]))In [139]:interp_s=ser.reindex(new_index).interpolate(method="pchip")In [140]:interp_s.loc[49:51]Out[140]:49.00    0.47141049.25    0.47684149.50    0.48178049.75    0.48599850.00    0.48926650.25    0.49181450.50    0.49399550.75    0.49576351.00    0.497074dtype: float64

Interpolation limits#

interpolate() accepts alimit keywordargument to limit the number of consecutiveNaN valuesfilled since the last valid observation

In [141]:ser=pd.Series([np.nan,np.nan,5,np.nan,np.nan,np.nan,13,np.nan,np.nan])In [142]:serOut[142]:0     NaN1     NaN2     5.03     NaN4     NaN5     NaN6    13.07     NaN8     NaNdtype: float64In [143]:ser.interpolate()Out[143]:0     NaN1     NaN2     5.03     7.04     9.05    11.06    13.07    13.08    13.0dtype: float64In [144]:ser.interpolate(limit=1)Out[144]:0     NaN1     NaN2     5.03     7.04     NaN5     NaN6    13.07    13.08     NaNdtype: float64

By default,NaN values are filled in aforward direction. Uselimit_direction parameter to fillbackward or fromboth directions.

In [145]:ser.interpolate(limit=1,limit_direction="backward")Out[145]:0     NaN1     5.02     5.03     NaN4     NaN5    11.06    13.07     NaN8     NaNdtype: float64In [146]:ser.interpolate(limit=1,limit_direction="both")Out[146]:0     NaN1     5.02     5.03     7.04     NaN5    11.06    13.07    13.08     NaNdtype: float64In [147]:ser.interpolate(limit_direction="both")Out[147]:0     5.01     5.02     5.03     7.04     9.05    11.06    13.07    13.08    13.0dtype: float64

By default,NaN values are filled whether they are surrounded byexisting valid values or outside existing valid values. Thelimit_areaparameter restricts filling to either inside or outside values.

# fill one consecutive inside value in both directionsIn [148]:ser.interpolate(limit_direction="both",limit_area="inside",limit=1)Out[148]:0     NaN1     NaN2     5.03     7.04     NaN5    11.06    13.07     NaN8     NaNdtype: float64# fill all consecutive outside values backwardIn [149]:ser.interpolate(limit_direction="backward",limit_area="outside")Out[149]:0     5.01     5.02     5.03     NaN4     NaN5     NaN6    13.07     NaN8     NaNdtype: float64# fill all consecutive outside values in both directionsIn [150]:ser.interpolate(limit_direction="both",limit_area="outside")Out[150]:0     5.01     5.02     5.03     NaN4     NaN5     NaN6    13.07    13.08    13.0dtype: float64

Replacing values#

Series.replace() andDataFrame.replace() can be used similar toSeries.fillna() andDataFrame.fillna() to replace or insert missing values.

In [151]:df=pd.DataFrame(np.eye(3))In [152]:dfOut[152]:     0    1    20  1.0  0.0  0.01  0.0  1.0  0.02  0.0  0.0  1.0In [153]:df_missing=df.replace(0,np.nan)In [154]:df_missingOut[154]:     0    1    20  1.0  NaN  NaN1  NaN  1.0  NaN2  NaN  NaN  1.0In [155]:df_filled=df_missing.replace(np.nan,2)In [156]:df_filledOut[156]:     0    1    20  1.0  2.0  2.01  2.0  1.0  2.02  2.0  2.0  1.0

Replacing more than one value is possible by passing a list.

In [157]:df_filled.replace([1,44],[2,28])Out[157]:     0    1    20  2.0  2.0  2.01  2.0  2.0  2.02  2.0  2.0  2.0

Replacing using a mapping dict.

In [158]:df_filled.replace({1:44,2:28})Out[158]:      0     1     20  44.0  28.0  28.01  28.0  44.0  28.02  28.0  28.0  44.0

Regular expression replacement#

Note

Python strings prefixed with ther character such asr'helloworld'are“raw” strings.They have different semantics regarding backslashes than strings without this prefix.Backslashes in raw strings will be interpreted as an escaped backslash, e.g.,r'\'=='\\'.

Replace the ‘.’ withNaN

In [159]:d={"a":list(range(4)),"b":list("ab.."),"c":["a","b",np.nan,"d"]}In [160]:df=pd.DataFrame(d)In [161]:df.replace(".",np.nan)Out[161]:   a    b    c0  0    a    a1  1    b    b2  2  NaN  NaN3  3  NaN    d

Replace the ‘.’ withNaN with regular expression that removes surrounding whitespace

In [162]:df.replace(r"\s*\.\s*",np.nan,regex=True)Out[162]:   a    b    c0  0    a    a1  1    b    b2  2  NaN  NaN3  3  NaN    d

Replace with a list of regexes.

In [163]:df.replace([r"\.",r"(a)"],["dot",r"\1stuff"],regex=True)Out[163]:   a       b       c0  0  astuff  astuff1  1       b       b2  2     dot     NaN3  3     dot       d

Replace with a regex in a mapping dict.

In [164]:df.replace({"b":r"\s*\.\s*"},{"b":np.nan},regex=True)Out[164]:   a    b    c0  0    a    a1  1    b    b2  2  NaN  NaN3  3  NaN    d

Pass nested dictionaries of regular expressions that use theregex keyword.

In [165]:df.replace({"b":{"b":r""}},regex=True)Out[165]:   a  b    c0  0  a    a1  1       b2  2  .  NaN3  3  .    dIn [166]:df.replace(regex={"b":{r"\s*\.\s*":np.nan}})Out[166]:   a    b    c0  0    a    a1  1    b    b2  2  NaN  NaN3  3  NaN    dIn [167]:df.replace({"b":r"\s*(\.)\s*"},{"b":r"\1ty"},regex=True)Out[167]:   a    b    c0  0    a    a1  1    b    b2  2  .ty  NaN3  3  .ty    d

Pass a list of regular expressions that will replace matches with a scalar.

In [168]:df.replace([r"\s*\.\s*",r"a|b"],"placeholder",regex=True)Out[168]:   a            b            c0  0  placeholder  placeholder1  1  placeholder  placeholder2  2  placeholder          NaN3  3  placeholder            d

All of the regular expression examples can also be passed with theto_replace argument as theregex argument. In this case thevalueargument must be passed explicitly by name orregex must be a nesteddictionary.

In [169]:df.replace(regex=[r"\s*\.\s*",r"a|b"],value="placeholder")Out[169]:   a            b            c0  0  placeholder  placeholder1  1  placeholder  placeholder2  2  placeholder          NaN3  3  placeholder            d

Note

A regular expression object fromre.compile is a valid input as well.


[8]ページ先頭

©2009-2025 Movatter.jp