Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Ctrl+K

Working with text data#

Text data types#

There are two ways to store text data in pandas:

  1. object -dtype NumPy array.

  2. StringDtype extension type.

We recommend usingStringDtype to store text data.

Prior to pandas 1.0,object dtype was the only option. This was unfortunatefor many reasons:

  1. You can accidentally store amixture of strings and non-strings in anobject dtype array. It’s better to have a dedicated dtype.

  2. object dtype breaks dtype-specific operations likeDataFrame.select_dtypes().There isn’t a clear way to selectjust text while excluding non-textbut still object-dtype columns.

  3. When reading code, the contents of anobject dtype array is less clearthan'string'.

Currently, the performance ofobject dtype arrays of strings andarrays.StringArray are about the same. We expect future enhancementsto significantly increase the performance and lower the memory overhead ofStringArray.

Warning

StringArray is currently considered experimental. The implementationand parts of the API may change without warning.

For backwards-compatibility,object dtype remains the default type weinfer a list of strings to

In [1]:pd.Series(["a","b","c"])Out[1]:0    a1    b2    cdtype: object

To explicitly requeststring dtype, specify thedtype

In [2]:pd.Series(["a","b","c"],dtype="string")Out[2]:0    a1    b2    cdtype: stringIn [3]:pd.Series(["a","b","c"],dtype=pd.StringDtype())Out[3]:0    a1    b2    cdtype: string

Orastype after theSeries orDataFrame is created

In [4]:s=pd.Series(["a","b","c"])In [5]:sOut[5]:0    a1    b2    cdtype: objectIn [6]:s.astype("string")Out[6]:0    a1    b2    cdtype: string

You can also useStringDtype/"string" as the dtype on non-string data andit will be converted tostring dtype:

In [7]:s=pd.Series(["a",2,np.nan],dtype="string")In [8]:sOut[8]:0       a1       22    <NA>dtype: stringIn [9]:type(s[1])Out[9]:str

or convert from existing pandas data:

In [10]:s1=pd.Series([1,2,np.nan],dtype="Int64")In [11]:s1Out[11]:0       11       22    <NA>dtype: Int64In [12]:s2=s1.astype("string")In [13]:s2Out[13]:0       11       22    <NA>dtype: stringIn [14]:type(s2[0])Out[14]:str

Behavior differences#

These are places where the behavior ofStringDtype objects differ fromobject dtype

  1. ForStringDtype,string accessor methodsthat returnnumeric output will always return a nullable integer dtype,rather than either int or float dtype, depending on the presence of NA values.Methods returningboolean output will return a nullable boolean dtype.

    In [15]:s=pd.Series(["a",None,"b"],dtype="string")In [16]:sOut[16]:0       a1    <NA>2       bdtype: stringIn [17]:s.str.count("a")Out[17]:0       11    <NA>2       0dtype: Int64In [18]:s.dropna().str.count("a")Out[18]:0    12    0dtype: Int64

    Both outputs areInt64 dtype. Compare that with object-dtype

    In [19]:s2=pd.Series(["a",None,"b"],dtype="object")In [20]:s2.str.count("a")Out[20]:0    1.01    NaN2    0.0dtype: float64In [21]:s2.dropna().str.count("a")Out[21]:0    12    0dtype: int64

    When NA values are present, the output dtype is float64. Similarly formethods returning boolean values.

    In [22]:s.str.isdigit()Out[22]:0    False1     <NA>2    Falsedtype: booleanIn [23]:s.str.match("a")Out[23]:0     True1     <NA>2    Falsedtype: boolean
  1. Some string methods, likeSeries.str.decode() are not availableonStringArray becauseStringArray only holds strings, notbytes.

  2. In comparison operations,arrays.StringArray andSeries backedby aStringArray will return an object withBooleanDtype,rather than abool dtype object. Missing values in aStringArraywill propagate in comparison operations, rather than always comparingunequal likenumpy.nan.

Everything else that follows in the rest of this document applies equally tostring andobject dtype.

String methods#

Series and Index are equipped with a set of string processing methodsthat make it easy to operate on each element of the array. Perhaps mostimportantly, these methods exclude missing/NA values automatically. These areaccessed via thestr attribute and generally have names matchingthe equivalent (scalar) built-in string methods:

In [24]:s=pd.Series(   ....:["A","B","C","Aaba","Baca",np.nan,"CABA","dog","cat"],dtype="string"   ....:)   ....:In [25]:s.str.lower()Out[25]:0       a1       b2       c3    aaba4    baca5    <NA>6    caba7     dog8     catdtype: stringIn [26]:s.str.upper()Out[26]:0       A1       B2       C3    AABA4    BACA5    <NA>6    CABA7     DOG8     CATdtype: stringIn [27]:s.str.len()Out[27]:0       11       12       13       44       45    <NA>6       47       38       3dtype: Int64
In [28]:idx=pd.Index([" jack","jill "," jesse ","frank"])In [29]:idx.str.strip()Out[29]:Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')In [30]:idx.str.lstrip()Out[30]:Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')In [31]:idx.str.rstrip()Out[31]:Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')

The string methods on Index are especially useful for cleaning up ortransforming DataFrame columns. For instance, you may have columns withleading or trailing whitespace:

In [32]:df=pd.DataFrame(   ....:np.random.randn(3,2),columns=[" Column A "," Column B "],index=range(3)   ....:)   ....:In [33]:dfOut[33]:   Column A   Column B0   0.469112  -0.2828631  -1.509059  -1.1356322   1.212112  -0.173215

Sincedf.columns is an Index object, we can use the.str accessor

In [34]:df.columns.str.strip()Out[34]:Index(['Column A', 'Column B'], dtype='object')In [35]:df.columns.str.lower()Out[35]:Index([' column a ', ' column b '], dtype='object')

These string methods can then be used to clean up the columns as needed.Here we are removing leading and trailing whitespaces, lower casing all names,and replacing any remaining whitespaces with underscores:

In [36]:df.columns=df.columns.str.strip().str.lower().str.replace(" ","_")In [37]:dfOut[37]:   column_a  column_b0  0.469112 -0.2828631 -1.509059 -1.1356322  1.212112 -0.173215

Note

If you have aSeries where lots of elements are repeated(i.e. the number of unique elements in theSeries is a lot smaller than the length of theSeries), it can be faster to convert the originalSeries to one of typecategory and then use.str.<method> or.dt.<property> on that.The performance difference comes from the fact that, forSeries of typecategory, thestring operations are done on the.categories and not on each element of theSeries.

Please note that aSeries of typecategory with string.categories hassome limitations in comparison toSeries of type string (e.g. you can’t add strings toeach other:s+""+s won’t work ifs is aSeries of typecategory). Also,.str methods which operate on elements of typelist are not available on such aSeries.

Warning

The type of the Series is inferred and the allowed types (i.e. strings).

Generally speaking, the.str accessor is intended to work only on strings. With very fewexceptions, other uses are not supported, and may be disabled at a later point.

Splitting and replacing strings#

Methods likesplit return a Series of lists:

In [38]:s2=pd.Series(["a_b_c","c_d_e",np.nan,"f_g_h"],dtype="string")In [39]:s2.str.split("_")Out[39]:0    [a, b, c]1    [c, d, e]2         <NA>3    [f, g, h]dtype: object

Elements in the split lists can be accessed usingget or[] notation:

In [40]:s2.str.split("_").str.get(1)Out[40]:0       b1       d2    <NA>3       gdtype: objectIn [41]:s2.str.split("_").str[1]Out[41]:0       b1       d2    <NA>3       gdtype: object

It is easy to expand this to return a DataFrame usingexpand.

In [42]:s2.str.split("_",expand=True)Out[42]:      0     1     20     a     b     c1     c     d     e2  <NA>  <NA>  <NA>3     f     g     h

When originalSeries hasStringDtype, the output columns will allbeStringDtype as well.

It is also possible to limit the number of splits:

In [43]:s2.str.split("_",expand=True,n=1)Out[43]:      0     10     a   b_c1     c   d_e2  <NA>  <NA>3     f   g_h

rsplit is similar tosplit except it works in the reverse direction,i.e., from the end of the string to the beginning of the string:

In [44]:s2.str.rsplit("_",expand=True,n=1)Out[44]:      0     10   a_b     c1   c_d     e2  <NA>  <NA>3   f_g     h

replace optionally usesregular expressions:

In [45]:s3=pd.Series(   ....:["A","B","C","Aaba","Baca","",np.nan,"CABA","dog","cat"],   ....:dtype="string",   ....:)   ....:In [46]:s3Out[46]:0       A1       B2       C3    Aaba4    Baca56    <NA>7    CABA8     dog9     catdtype: stringIn [47]:s3.str.replace("^.a|dog","XX-XX ",case=False,regex=True)Out[47]:0           A1           B2           C3    XX-XX ba4    XX-XX ca56        <NA>7    XX-XX BA8      XX-XX9     XX-XX tdtype: string

Changed in version 2.0.

Single character pattern withregex=True will also be treated as regular expressions:

In [48]:s4=pd.Series(["a.b",".","b",np.nan,""],dtype="string")In [49]:s4Out[49]:0     a.b1       .2       b3    <NA>4dtype: stringIn [50]:s4.str.replace(".","a",regex=True)Out[50]:0     aaa1       a2       a3    <NA>4dtype: string

If you want literal replacement of a string (equivalent tostr.replace()), youcan set the optionalregex parameter toFalse, rather than escaping eachcharacter. In this case bothpat andrepl must be strings:

In [51]:dollars=pd.Series(["12","-$10","$10,000"],dtype="string")# These lines are equivalentIn [52]:dollars.str.replace(r"-\$","-",regex=True)Out[52]:0         121        -102    $10,000dtype: stringIn [53]:dollars.str.replace("-$","-",regex=False)Out[53]:0         121        -102    $10,000dtype: string

Thereplace method can also take a callable as replacement. It is calledon everypat usingre.sub(). The callable should expect onepositional argument (a regex object) and return a string.

# Reverse every lowercase alphabetic wordIn [54]:pat=r"[a-z]+"In [55]:defrepl(m):   ....:returnm.group(0)[::-1]   ....:In [56]:pd.Series(["foo 123","bar baz",np.nan],dtype="string").str.replace(   ....:pat,repl,regex=True   ....:)   ....:Out[56]:0    oof 1231    rab zab2       <NA>dtype: string# Using regex groupsIn [57]:pat=r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"In [58]:defrepl(m):   ....:returnm.group("two").swapcase()   ....:In [59]:pd.Series(["Foo Bar Baz",np.nan],dtype="string").str.replace(   ....:pat,repl,regex=True   ....:)   ....:Out[59]:0     bAR1    <NA>dtype: string

Thereplace method also accepts a compiled regular expression objectfromre.compile() as a pattern. All flags should be included in thecompiled regular expression object.

In [60]:importreIn [61]:regex_pat=re.compile(r"^.a|dog",flags=re.IGNORECASE)In [62]:s3.str.replace(regex_pat,"XX-XX ",regex=True)Out[62]:0           A1           B2           C3    XX-XX ba4    XX-XX ca56        <NA>7    XX-XX BA8      XX-XX9     XX-XX tdtype: string

Including aflags argument when callingreplace with a compiledregular expression object will raise aValueError.

In [63]:s3.str.replace(regex_pat,'XX-XX ',flags=re.IGNORECASE)---------------------------------------------------------------------------ValueError: case and flags cannot be set when pat is a compiled regex

removeprefix andremovesuffix have the same effect asstr.removeprefix andstr.removesuffix added in Python 3.9<https://docs.python.org/3/library/stdtypes.html#str.removeprefix>`__:

Added in version 1.4.0.

In [64]:s=pd.Series(["str_foo","str_bar","no_prefix"])In [65]:s.str.removeprefix("str_")Out[65]:0          foo1          bar2    no_prefixdtype: objectIn [66]:s=pd.Series(["foo_str","bar_str","no_suffix"])In [67]:s.str.removesuffix("_str")Out[67]:0          foo1          bar2    no_suffixdtype: object

Concatenation#

There are several ways to concatenate aSeries orIndex, either with itself or others, all based oncat(),resp.Index.str.cat.

Concatenating a single Series into a string#

The content of aSeries (orIndex) can be concatenated:

In [68]:s=pd.Series(["a","b","c","d"],dtype="string")In [69]:s.str.cat(sep=",")Out[69]:'a,b,c,d'

If not specified, the keywordsep for the separator defaults to the empty string,sep='':

In [70]:s.str.cat()Out[70]:'abcd'

By default, missing values are ignored. Usingna_rep, they can be given a representation:

In [71]:t=pd.Series(["a","b",np.nan,"d"],dtype="string")In [72]:t.str.cat(sep=",")Out[72]:'a,b,d'In [73]:t.str.cat(sep=",",na_rep="-")Out[73]:'a,b,-,d'

Concatenating a Series and something list-like into a Series#

The first argument tocat() can be a list-like object, provided that it matches the length of the callingSeries (orIndex).

In [74]:s.str.cat(["A","B","C","D"])Out[74]:0    aA1    bB2    cC3    dDdtype: string

Missing values on either side will result in missing values in the result as well,unlessna_rep is specified:

In [75]:s.str.cat(t)Out[75]:0      aa1      bb2    <NA>3      dddtype: stringIn [76]:s.str.cat(t,na_rep="-")Out[76]:0    aa1    bb2    c-3    dddtype: string

Concatenating a Series and something array-like into a Series#

The parameterothers can also be two-dimensional. In this case, the number or rows must match the lengths of the callingSeries (orIndex).

In [77]:d=pd.concat([t,s],axis=1)In [78]:sOut[78]:0    a1    b2    c3    ddtype: stringIn [79]:dOut[79]:      0  10     a  a1     b  b2  <NA>  c3     d  dIn [80]:s.str.cat(d,na_rep="-")Out[80]:0    aaa1    bbb2    c-c3    ddddtype: string

Concatenating a Series and an indexed object into a Series, with alignment#

For concatenation with aSeries orDataFrame, it is possible to align the indexes before concatenation by settingthejoin-keyword.

In [81]:u=pd.Series(["b","d","a","c"],index=[1,3,0,2],dtype="string")In [82]:sOut[82]:0    a1    b2    c3    ddtype: stringIn [83]:uOut[83]:1    b3    d0    a2    cdtype: stringIn [84]:s.str.cat(u)Out[84]:0    aa1    bb2    cc3    dddtype: stringIn [85]:s.str.cat(u,join="left")Out[85]:0    aa1    bb2    cc3    dddtype: string

The usual options are available forjoin (one of'left','outer','inner','right').In particular, alignment also means that the different lengths do not need to coincide anymore.

In [86]:v=pd.Series(["z","a","b","d","e"],index=[-1,0,1,3,4],dtype="string")In [87]:sOut[87]:0    a1    b2    c3    ddtype: stringIn [88]:vOut[88]:-1    z 0    a 1    b 3    d 4    edtype: stringIn [89]:s.str.cat(v,join="left",na_rep="-")Out[89]:0    aa1    bb2    c-3    dddtype: stringIn [90]:s.str.cat(v,join="outer",na_rep="-")Out[90]:-1    -z 0    aa 1    bb 2    c- 3    dd 4    -edtype: string

The same alignment can be used whenothers is aDataFrame:

In [91]:f=d.loc[[3,2,1,0],:]In [92]:sOut[92]:0    a1    b2    c3    ddtype: stringIn [93]:fOut[93]:      0  13     d  d2  <NA>  c1     b  b0     a  aIn [94]:s.str.cat(f,join="left",na_rep="-")Out[94]:0    aaa1    bbb2    c-c3    ddddtype: string

Concatenating a Series and many objects into a Series#

Several array-like items (specifically:Series,Index, and 1-dimensional variants ofnp.ndarray)can be combined in a list-like container (including iterators,dict-views, etc.).

In [95]:sOut[95]:0    a1    b2    c3    ddtype: stringIn [96]:uOut[96]:1    b3    d0    a2    cdtype: stringIn [97]:s.str.cat([u,u.to_numpy()],join="left")Out[97]:0    aab1    bbd2    cca3    ddcdtype: string

All elements without an index (e.g.np.ndarray) within the passed list-like must match in length to the callingSeries (orIndex),butSeries andIndex may have arbitrary length (as long as alignment is not disabled withjoin=None):

In [98]:vOut[98]:-1    z 0    a 1    b 3    d 4    edtype: stringIn [99]:s.str.cat([v,u,u.to_numpy()],join="outer",na_rep="-")Out[99]:-1    -z--0     aaab1     bbbd2     c-ca3     dddc4     -e--dtype: string

If usingjoin='right' on a list-like ofothers that contains different indexes,the union of these indexes will be used as the basis for the final concatenation:

In [100]:u.loc[[3]]Out[100]:3    ddtype: stringIn [101]:v.loc[[-1,0]]Out[101]:-1    z 0    adtype: stringIn [102]:s.str.cat([u.loc[[3]],v.loc[[-1,0]]],join="right",na_rep="-")Out[102]: 3    dd--1    --z 0    a-adtype: string

Indexing with.str#

You can use[] notation to directly index by position locations. If you index past the endof the string, the result will be aNaN.

In [103]:s=pd.Series(   .....:["A","B","C","Aaba","Baca",np.nan,"CABA","dog","cat"],dtype="string"   .....:)   .....:In [104]:s.str[0]Out[104]:0       A1       B2       C3       A4       B5    <NA>6       C7       d8       cdtype: stringIn [105]:s.str[1]Out[105]:0    <NA>1    <NA>2    <NA>3       a4       a5    <NA>6       A7       o8       adtype: string

Extracting substrings#

Extract first match in each subject (extract)#

Theextract method accepts aregular expression with at least onecapture group.

Extracting a regular expression with more than one group returns aDataFrame with one column per group.

In [106]:pd.Series(   .....:["a1","b2","c3"],   .....:dtype="string",   .....:).str.extract(r"([ab])(\d)",expand=False)   .....:Out[106]:      0     10     a     11     b     22  <NA>  <NA>

Elements that do not match return a row filled withNaN. Thus, aSeries of messy strings can be “converted” into a like-indexed Seriesor DataFrame of cleaned-up or more useful strings, withoutnecessitatingget() to access tuples orre.match objects. Thedtype of the result is always object, even if no match is found andthe result only containsNaN.

Named groups like

In [107]:pd.Series(["a1","b2","c3"],dtype="string").str.extract(   .....:r"(?P<letter>[ab])(?P<digit>\d)",expand=False   .....:)   .....:Out[107]:  letter digit0      a     11      b     22   <NA>  <NA>

and optional groups like

In [108]:pd.Series(   .....:["a1","b2","3"],   .....:dtype="string",   .....:).str.extract(r"([ab])?(\d)",expand=False)   .....:Out[108]:      0  10     a  11     b  22  <NA>  3

can also be used. Note that any capture group names in the regularexpression will be used for column names; otherwise capture groupnumbers will be used.

Extracting a regular expression with one group returns aDataFramewith one column ifexpand=True.

In [109]:pd.Series(["a1","b2","c3"],dtype="string").str.extract(r"[ab](\d)",expand=True)Out[109]:      00     11     22  <NA>

It returns a Series ifexpand=False.

In [110]:pd.Series(["a1","b2","c3"],dtype="string").str.extract(r"[ab](\d)",expand=False)Out[110]:0       11       22    <NA>dtype: string

Calling on anIndex with a regex with exactly one capture groupreturns aDataFrame with one column ifexpand=True.

In [111]:s=pd.Series(["a1","b2","c3"],["A11","B22","C33"],dtype="string")In [112]:sOut[112]:A11    a1B22    b2C33    c3dtype: stringIn [113]:s.index.str.extract("(?P<letter>[a-zA-Z])",expand=True)Out[113]:  letter0      A1      B2      C

It returns anIndex ifexpand=False.

In [114]:s.index.str.extract("(?P<letter>[a-zA-Z])",expand=False)Out[114]:Index(['A', 'B', 'C'], dtype='object', name='letter')

Calling on anIndex with a regex with more than one capture groupreturns aDataFrame ifexpand=True.

In [115]:s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=True)Out[115]:  letter   10      A  111      B  222      C  33

It raisesValueError ifexpand=False.

In [116]:s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=False)---------------------------------------------------------------------------ValueErrorTraceback (most recent call last)CellIn[116],line1---->1s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=False)File ~/work/pandas/pandas/pandas/core/strings/accessor.py:137, inforbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)132msg=(133f"Cannot use .str.{func_name} with values of "134f"inferred dtype '{self._inferred_dtype}'."135)136raiseTypeError(msg)-->137returnfunc(self,*args,**kwargs)File ~/work/pandas/pandas/pandas/core/strings/accessor.py:2743, inStringMethods.extract(self, pat, flags, expand)2740raiseValueError("pattern contains no capture groups")2742ifnotexpandandregex.groups>1andisinstance(self._data,ABCIndex):->2743raiseValueError("only one regex group is supported with Index")2745obj=self._data2746result_dtype=_result_dtype(obj)ValueError: only one regex group is supported with Index

The table below summarizes the behavior ofextract(expand=False)(input subject in first column, number of groups in regex infirst row)

1 group

>1 group

Index

Index

ValueError

Series

Series

DataFrame

Extract all matches in each subject (extractall)#

Unlikeextract (which returns only the first match),

In [117]:s=pd.Series(["a1a2","b1","c1"],index=["A","B","C"],dtype="string")In [118]:sOut[118]:A    a1a2B      b1C      c1dtype: stringIn [119]:two_groups="(?P<letter>[a-z])(?P<digit>[0-9])"In [120]:s.str.extract(two_groups,expand=True)Out[120]:  letter digitA      a     1B      b     1C      c     1

theextractall method returns every match. The result ofextractall is always aDataFrame with aMultiIndex on itsrows. The last level of theMultiIndex is namedmatch andindicates the order in the subject.

In [121]:s.str.extractall(two_groups)Out[121]:        letter digit  matchA 0          a     1  1          a     2B 0          b     1C 0          c     1

When each subject string in the Series has exactly one match,

In [122]:s=pd.Series(["a3","b3","c2"],dtype="string")In [123]:sOut[123]:0    a31    b32    c2dtype: string

thenextractall(pat).xs(0,level='match') gives the same result asextract(pat).

In [124]:extract_result=s.str.extract(two_groups,expand=True)In [125]:extract_resultOut[125]:  letter digit0      a     31      b     32      c     2In [126]:extractall_result=s.str.extractall(two_groups)In [127]:extractall_resultOut[127]:        letter digit  match0 0          a     31 0          b     32 0          c     2In [128]:extractall_result.xs(0,level="match")Out[128]:  letter digit0      a     31      b     32      c     2

Index also supports.str.extractall. It returns aDataFrame which has thesame result as aSeries.str.extractall with a default index (starts from 0).

In [129]:pd.Index(["a1a2","b1","c1"]).str.extractall(two_groups)Out[129]:        letter digit  match0 0          a     1  1          a     21 0          b     12 0          c     1In [130]:pd.Series(["a1a2","b1","c1"],dtype="string").str.extractall(two_groups)Out[130]:        letter digit  match0 0          a     1  1          a     21 0          b     12 0          c     1

Testing for strings that match or contain a pattern#

You can check whether elements contain a pattern:

In [131]:pattern=r"[0-9][a-z]"In [132]:pd.Series(   .....:["1","2","3a","3b","03c","4dx"],   .....:dtype="string",   .....:).str.contains(pattern)   .....:Out[132]:0    False1    False2     True3     True4     True5     Truedtype: boolean

Or whether elements match a pattern:

In [133]:pd.Series(   .....:["1","2","3a","3b","03c","4dx"],   .....:dtype="string",   .....:).str.match(pattern)   .....:Out[133]:0    False1    False2     True3     True4    False5     Truedtype: boolean
In [134]:pd.Series(   .....:["1","2","3a","3b","03c","4dx"],   .....:dtype="string",   .....:).str.fullmatch(pattern)   .....:Out[134]:0    False1    False2     True3     True4    False5    Falsedtype: boolean

Note

The distinction betweenmatch,fullmatch, andcontains is strictness:fullmatch tests whether the entire string matches the regular expression;match tests whether there is a match of the regular expression that beginsat the first character of the string; andcontains tests whether there isa match of the regular expression at any position within the string.

The corresponding functions in there package for these three match modes arere.fullmatch,re.match, andre.search,respectively.

Methods likematch,fullmatch,contains,startswith, andendswith take an extrana argument so missing values can be consideredTrue or False:

In [135]:s4=pd.Series(   .....:["A","B","C","Aaba","Baca",np.nan,"CABA","dog","cat"],dtype="string"   .....:)   .....:In [136]:s4.str.contains("A",na=False)Out[136]:0     True1    False2    False3     True4    False5    False6     True7    False8    Falsedtype: boolean

Creating indicator variables#

You can extract dummy variables from string columns.For example if they are separated by a'|':

In [137]:s=pd.Series(["a","a|b",np.nan,"a|c"],dtype="string")In [138]:s.str.get_dummies(sep="|")Out[138]:   a  b  c0  1  0  01  1  1  02  0  0  03  1  0  1

StringIndex also supportsget_dummies which returns aMultiIndex.

In [139]:idx=pd.Index(["a","a|b",np.nan,"a|c"])In [140]:idx.str.get_dummies(sep="|")Out[140]:MultiIndex([(1, 0, 0),            (1, 1, 0),            (0, 0, 0),            (1, 0, 1)],           names=['a', 'b', 'c'])

See alsoget_dummies().

Method summary#

Method

Description

cat()

Concatenate strings

split()

Split strings on delimiter

rsplit()

Split strings on delimiter working from the end of the string

get()

Index into each element (retrieve i-th element)

join()

Join strings in each element of the Series with passed separator

get_dummies()

Split strings on the delimiter returning DataFrame of dummy variables

contains()

Return boolean array if each string contains pattern/regex

replace()

Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence

removeprefix()

Remove prefix from string i.e. only remove if string starts with prefix.

removesuffix()

Remove suffix from string i.e. only remove if string ends with suffix.

repeat()

Duplicate values (s.str.repeat(3) equivalent tox*3)

pad()

Add whitespace to the sides of strings

center()

Equivalent tostr.center

ljust()

Equivalent tostr.ljust

rjust()

Equivalent tostr.rjust

zfill()

Equivalent tostr.zfill

wrap()

Split long strings into lines with length less than a given width

slice()

Slice each string in the Series

slice_replace()

Replace slice in each string with passed value

count()

Count occurrences of pattern

startswith()

Equivalent tostr.startswith(pat) for each element

endswith()

Equivalent tostr.endswith(pat) for each element

findall()

Compute list of all occurrences of pattern/regex for each string

match()

Callre.match on each element returning matched groups as list

extract()

Callre.search on each element returning DataFrame with one row for each element and one column for each regex capture group

extractall()

Callre.findall on each element returning DataFrame with one row for each match and one column for each regex capture group

len()

Compute string lengths

strip()

Equivalent tostr.strip

rstrip()

Equivalent tostr.rstrip

lstrip()

Equivalent tostr.lstrip

partition()

Equivalent tostr.partition

rpartition()

Equivalent tostr.rpartition

lower()

Equivalent tostr.lower

casefold()

Equivalent tostr.casefold

upper()

Equivalent tostr.upper

find()

Equivalent tostr.find

rfind()

Equivalent tostr.rfind

index()

Equivalent tostr.index

rindex()

Equivalent tostr.rindex

capitalize()

Equivalent tostr.capitalize

swapcase()

Equivalent tostr.swapcase

normalize()

Return Unicode normal form. Equivalent tounicodedata.normalize

translate()

Equivalent tostr.translate

isalnum()

Equivalent tostr.isalnum

isalpha()

Equivalent tostr.isalpha

isdigit()

Equivalent tostr.isdigit

isspace()

Equivalent tostr.isspace

islower()

Equivalent tostr.islower

isupper()

Equivalent tostr.isupper

istitle()

Equivalent tostr.istitle

isnumeric()

Equivalent tostr.isnumeric

isdecimal()

Equivalent tostr.isdecimal


[8]ページ先頭

©2009-2025 Movatter.jp