Working with text data #

Text data types#

There are two ways to store text data in pandas:

object -dtype NumPy array.
StringDtype extension type.

We recommend usingStringDtype to store text data.

Prior to pandas 1.0,object dtype was the only option. This was unfortunatefor many reasons:

You can accidentally store amixture of strings and non-strings in anobject dtype array. It’s better to have a dedicated dtype.
object dtype breaks dtype-specific operations likeDataFrame.select_dtypes().There isn’t a clear way to selectjust text while excluding non-textbut still object-dtype columns.
When reading code, the contents of anobject dtype array is less clearthan'string'.

Currently, the performance ofobject dtype arrays of strings andarrays.StringArray are about the same. We expect future enhancementsto significantly increase the performance and lower the memory overhead ofStringArray.

Warning

StringArray is currently considered experimental. The implementationand parts of the API may change without warning.

For backwards-compatibility,object dtype remains the default type weinfer a list of strings to

In [1]:pd.Series(["a","b","c"])Out[1]:0    a1    b2    cdtype: object

To explicitly requeststring dtype, specify thedtype

In [2]:pd.Series(["a","b","c"],dtype="string")Out[2]:0    a1    b2    cdtype: stringIn [3]:pd.Series(["a","b","c"],dtype=pd.StringDtype())Out[3]:0    a1    b2    cdtype: string

Orastype after theSeries orDataFrame is created

In [4]:s=pd.Series(["a","b","c"])In [5]:sOut[5]:0    a1    b2    cdtype: objectIn [6]:s.astype("string")Out[6]:0    a1    b2    cdtype: string

You can also useStringDtype/"string" as the dtype on non-string data andit will be converted tostring dtype:

In [7]:s=pd.Series(["a",2,np.nan],dtype="string")In [8]:sOut[8]:0       a1       22    <NA>dtype: stringIn [9]:type(s[1])Out[9]:str

or convert from existing pandas data:

In [10]:s1=pd.Series([1,2,np.nan],dtype="Int64")In [11]:s1Out[11]:0       11       22    <NA>dtype: Int64In [12]:s2=s1.astype("string")In [13]:s2Out[13]:0       11       22    <NA>dtype: stringIn [14]:type(s2[0])Out[14]:str

Behavior differences#

These are places where the behavior ofStringDtype objects differ fromobject dtype

ForStringDtype,string accessor methodsthat returnnumeric output will always return a nullable integer dtype,rather than either int or float dtype, depending on the presence of NA values.Methods returningboolean output will return a nullable boolean dtype.

In [15]:s=pd.Series(["a",None,"b"],dtype="string")In [16]:sOut[16]:0       a1    <NA>2       bdtype: stringIn [17]:s.str.count("a")Out[17]:0       11    <NA>2       0dtype: Int64In [18]:s.dropna().str.count("a")Out[18]:0    12    0dtype: Int64

Both outputs areInt64 dtype. Compare that with object-dtype

In [19]:s2=pd.Series(["a",None,"b"],dtype="object")In [20]:s2.str.count("a")Out[20]:0    1.01    NaN2    0.0dtype: float64In [21]:s2.dropna().str.count("a")Out[21]:0    12    0dtype: int64

When NA values are present, the output dtype is float64. Similarly formethods returning boolean values.

In [22]:s.str.isdigit()Out[22]:0    False1     <NA>2    Falsedtype: booleanIn [23]:s.str.match("a")Out[23]:0     True1     <NA>2    Falsedtype: boolean

Some string methods, likeSeries.str.decode() are not availableonStringArray becauseStringArray only holds strings, notbytes.
In comparison operations,arrays.StringArray andSeries backedby aStringArray will return an object withBooleanDtype,rather than abool dtype object. Missing values in aStringArraywill propagate in comparison operations, rather than always comparingunequal likenumpy.nan.

Everything else that follows in the rest of this document applies equally tostring andobject dtype.

String methods#

Series and Index are equipped with a set of string processing methodsthat make it easy to operate on each element of the array. Perhaps mostimportantly, these methods exclude missing/NA values automatically. These areaccessed via thestr attribute and generally have names matchingthe equivalent (scalar) built-in string methods:

In [24]:s=pd.Series(   ....:["A","B","C","Aaba","Baca",np.nan,"CABA","dog","cat"],dtype="string"   ....:)   ....:In [25]:s.str.lower()Out[25]:0       a1       b2       c3    aaba4    baca5    <NA>6    caba7     dog8     catdtype: stringIn [26]:s.str.upper()Out[26]:0       A1       B2       C3    AABA4    BACA5    <NA>6    CABA7     DOG8     CATdtype: stringIn [27]:s.str.len()Out[27]:0       11       12       13       44       45    <NA>6       47       38       3dtype: Int64

In [28]:idx=pd.Index([" jack","jill "," jesse ","frank"])In [29]:idx.str.strip()Out[29]:Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')In [30]:idx.str.lstrip()Out[30]:Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')In [31]:idx.str.rstrip()Out[31]:Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')

The string methods on Index are especially useful for cleaning up ortransforming DataFrame columns. For instance, you may have columns withleading or trailing whitespace:

In [32]:df=pd.DataFrame(   ....:np.random.randn(3,2),columns=[" Column A "," Column B "],index=range(3)   ....:)   ....:In [33]:dfOut[33]:   Column A   Column B0   0.469112  -0.2828631  -1.509059  -1.1356322   1.212112  -0.173215

Sincedf.columns is an Index object, we can use the.str accessor

In [34]:df.columns.str.strip()Out[34]:Index(['Column A', 'Column B'], dtype='object')In [35]:df.columns.str.lower()Out[35]:Index([' column a ', ' column b '], dtype='object')

These string methods can then be used to clean up the columns as needed.Here we are removing leading and trailing whitespaces, lower casing all names,and replacing any remaining whitespaces with underscores:

In [36]:df.columns=df.columns.str.strip().str.lower().str.replace(" ","_")In [37]:dfOut[37]:   column_a  column_b0  0.469112 -0.2828631 -1.509059 -1.1356322  1.212112 -0.173215

Note

If you have aSeries where lots of elements are repeated(i.e. the number of unique elements in theSeries is a lot smaller than the length of theSeries), it can be faster to convert the originalSeries to one of typecategory and then use.str.<method> or.dt.<property> on that.The performance difference comes from the fact that, forSeries of typecategory, thestring operations are done on the.categories and not on each element of theSeries.

Please note that aSeries of typecategory with string.categories hassome limitations in comparison toSeries of type string (e.g. you can’t add strings toeach other:s+""+s won’t work ifs is aSeries of typecategory). Also,.str methods which operate on elements of typelist are not available on such aSeries.

Warning

The type of the Series is inferred and the allowed types (i.e. strings).

Generally speaking, the.str accessor is intended to work only on strings. With very fewexceptions, other uses are not supported, and may be disabled at a later point.

Splitting and replacing strings#

Methods likesplit return a Series of lists:

In [38]:s2=pd.Series(["a_b_c","c_d_e",np.nan,"f_g_h"],dtype="string")In [39]:s2.str.split("_")Out[39]:0    [a, b, c]1    [c, d, e]2         <NA>3    [f, g, h]dtype: object

Elements in the split lists can be accessed usingget or[] notation:

In [40]:s2.str.split("_").str.get(1)Out[40]:0       b1       d2    <NA>3       gdtype: objectIn [41]:s2.str.split("_").str[1]Out[41]:0       b1       d2    <NA>3       gdtype: object

It is easy to expand this to return a DataFrame usingexpand.

In [42]:s2.str.split("_",expand=True)Out[42]:      0     1     20     a     b     c1     c     d     e2  <NA>  <NA>  <NA>3     f     g     h

When originalSeries hasStringDtype, the output columns will allbeStringDtype as well.

It is also possible to limit the number of splits:

In [43]:s2.str.split("_",expand=True,n=1)Out[43]:      0     10     a   b_c1     c   d_e2  <NA>  <NA>3     f   g_h

rsplit is similar tosplit except it works in the reverse direction,i.e., from the end of the string to the beginning of the string:

In [44]:s2.str.rsplit("_",expand=True,n=1)Out[44]:      0     10   a_b     c1   c_d     e2  <NA>  <NA>3   f_g     h

replace optionally usesregular expressions:

In [45]:s3=pd.Series(   ....:["A","B","C","Aaba","Baca","",np.nan,"CABA","dog","cat"],   ....:dtype="string",   ....:)   ....:In [46]:s3Out[46]:0       A1       B2       C3    Aaba4    Baca56    <NA>7    CABA8     dog9     catdtype: stringIn [47]:s3.str.replace("^.a|dog","XX-XX ",case=False,regex=True)Out[47]:0           A1           B2           C3    XX-XX ba4    XX-XX ca56        <NA>7    XX-XX BA8      XX-XX9     XX-XX tdtype: string

Changed in version 2.0.

Single character pattern withregex=True will also be treated as regular expressions:

In [48]:s4=pd.Series(["a.b",".","b",np.nan,""],dtype="string")In [49]:s4Out[49]:0     a.b1       .2       b3    <NA>4dtype: stringIn [50]:s4.str.replace(".","a",regex=True)Out[50]:0     aaa1       a2       a3    <NA>4dtype: string

If you want literal replacement of a string (equivalent tostr.replace()), youcan set the optionalregex parameter toFalse, rather than escaping eachcharacter. In this case bothpat andrepl must be strings:

In [51]:dollars=pd.Series(["12","-$10","$10,000"],dtype="string")# These lines are equivalentIn [52]:dollars.str.replace(r"-\$","-",regex=True)Out[52]:0         121        -102    $10,000dtype: stringIn [53]:dollars.str.replace("-$","-",regex=False)Out[53]:0         121        -102    $10,000dtype: string

Thereplace method can also take a callable as replacement. It is calledon everypat usingre.sub(). The callable should expect onepositional argument (a regex object) and return a string.

# Reverse every lowercase alphabetic wordIn [54]:pat=r"[a-z]+"In [55]:defrepl(m):   ....:returnm.group(0)[::-1]   ....:In [56]:pd.Series(["foo 123","bar baz",np.nan],dtype="string").str.replace(   ....:pat,repl,regex=True   ....:)   ....:Out[56]:0    oof 1231    rab zab2       <NA>dtype: string# Using regex groupsIn [57]:pat=r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"In [58]:defrepl(m):   ....:returnm.group("two").swapcase()   ....:In [59]:pd.Series(["Foo Bar Baz",np.nan],dtype="string").str.replace(   ....:pat,repl,regex=True   ....:)   ....:Out[59]:0     bAR1    <NA>dtype: string

Thereplace method also accepts a compiled regular expression objectfromre.compile() as a pattern. All flags should be included in thecompiled regular expression object.

In [60]:importreIn [61]:regex_pat=re.compile(r"^.a|dog",flags=re.IGNORECASE)In [62]:s3.str.replace(regex_pat,"XX-XX ",regex=True)Out[62]:0           A1           B2           C3    XX-XX ba4    XX-XX ca56        <NA>7    XX-XX BA8      XX-XX9     XX-XX tdtype: string

Including aflags argument when callingreplace with a compiledregular expression object will raise aValueError.

In [63]:s3.str.replace(regex_pat,'XX-XX ',flags=re.IGNORECASE)---------------------------------------------------------------------------ValueError: case and flags cannot be set when pat is a compiled regex

removeprefix andremovesuffix have the same effect asstr.removeprefix andstr.removesuffix added in Python 3.9<https://docs.python.org/3/library/stdtypes.html#str.removeprefix>`__:

Added in version 1.4.0.

In [64]:s=pd.Series(["str_foo","str_bar","no_prefix"])In [65]:s.str.removeprefix("str_")Out[65]:0          foo1          bar2    no_prefixdtype: objectIn [66]:s=pd.Series(["foo_str","bar_str","no_suffix"])In [67]:s.str.removesuffix("_str")Out[67]:0          foo1          bar2    no_suffixdtype: object

Concatenation#

There are several ways to concatenate aSeries orIndex, either with itself or others, all based oncat(),resp.Index.str.cat.

Concatenating a single Series into a string#

The content of aSeries (orIndex) can be concatenated:

In [68]:s=pd.Series(["a","b","c","d"],dtype="string")In [69]:s.str.cat(sep=",")Out[69]:'a,b,c,d'

If not specified, the keywordsep for the separator defaults to the empty string,sep='':

In [70]:s.str.cat()Out[70]:'abcd'

By default, missing values are ignored. Usingna_rep, they can be given a representation:

In [71]:t=pd.Series(["a","b",np.nan,"d"],dtype="string")In [72]:t.str.cat(sep=",")Out[72]:'a,b,d'In [73]:t.str.cat(sep=",",na_rep="-")Out[73]:'a,b,-,d'

Concatenating a Series and something list-like into a Series#

The first argument tocat() can be a list-like object, provided that it matches the length of the callingSeries (orIndex).

In [74]:s.str.cat(["A","B","C","D"])Out[74]:0    aA1    bB2    cC3    dDdtype: string

Missing values on either side will result in missing values in the result as well,unlessna_rep is specified:

In [75]:s.str.cat(t)Out[75]:0      aa1      bb2    <NA>3      dddtype: stringIn [76]:s.str.cat(t,na_rep="-")Out[76]:0    aa1    bb2    c-3    dddtype: string

Concatenating a Series and something array-like into a Series#

The parameterothers can also be two-dimensional. In this case, the number or rows must match the lengths of the callingSeries (orIndex).

In [77]:d=pd.concat([t,s],axis=1)In [78]:sOut[78]:0    a1    b2    c3    ddtype: stringIn [79]:dOut[79]:      0  10     a  a1     b  b2  <NA>  c3     d  dIn [80]:s.str.cat(d,na_rep="-")Out[80]:0    aaa1    bbb2    c-c3    ddddtype: string

Concatenating a Series and an indexed object into a Series, with alignment#

For concatenation with aSeries orDataFrame, it is possible to align the indexes before concatenation by settingthejoin-keyword.

In [81]:u=pd.Series(["b","d","a","c"],index=[1,3,0,2],dtype="string")In [82]:sOut[82]:0    a1    b2    c3    ddtype: stringIn [83]:uOut[83]:1    b3    d0    a2    cdtype: stringIn [84]:s.str.cat(u)Out[84]:0    aa1    bb2    cc3    dddtype: stringIn [85]:s.str.cat(u,join="left")Out[85]:0    aa1    bb2    cc3    dddtype: string

The usual options are available forjoin (one of'left','outer','inner','right').In particular, alignment also means that the different lengths do not need to coincide anymore.

In [86]:v=pd.Series(["z","a","b","d","e"],index=[-1,0,1,3,4],dtype="string")In [87]:sOut[87]:0    a1    b2    c3    ddtype: stringIn [88]:vOut[88]:-1    z 0    a 1    b 3    d 4    edtype: stringIn [89]:s.str.cat(v,join="left",na_rep="-")Out[89]:0    aa1    bb2    c-3    dddtype: stringIn [90]:s.str.cat(v,join="outer",na_rep="-")Out[90]:-1    -z 0    aa 1    bb 2    c- 3    dd 4    -edtype: string

The same alignment can be used whenothers is aDataFrame:

In [91]:f=d.loc[[3,2,1,0],:]In [92]:sOut[92]:0    a1    b2    c3    ddtype: stringIn [93]:fOut[93]:      0  13     d  d2  <NA>  c1     b  b0     a  aIn [94]:s.str.cat(f,join="left",na_rep="-")Out[94]:0    aaa1    bbb2    c-c3    ddddtype: string

Concatenating a Series and many objects into a Series#

Several array-like items (specifically:Series,Index, and 1-dimensional variants ofnp.ndarray)can be combined in a list-like container (including iterators,dict-views, etc.).

In [95]:sOut[95]:0    a1    b2    c3    ddtype: stringIn [96]:uOut[96]:1    b3    d0    a2    cdtype: stringIn [97]:s.str.cat([u,u.to_numpy()],join="left")Out[97]:0    aab1    bbd2    cca3    ddcdtype: string

All elements without an index (e.g.np.ndarray) within the passed list-like must match in length to the callingSeries (orIndex),butSeries andIndex may have arbitrary length (as long as alignment is not disabled withjoin=None):

In [98]:vOut[98]:-1    z 0    a 1    b 3    d 4    edtype: stringIn [99]:s.str.cat([v,u,u.to_numpy()],join="outer",na_rep="-")Out[99]:-1    -z--0     aaab1     bbbd2     c-ca3     dddc4     -e--dtype: string

If usingjoin='right' on a list-like ofothers that contains different indexes,the union of these indexes will be used as the basis for the final concatenation:

In [100]:u.loc[[3]]Out[100]:3    ddtype: stringIn [101]:v.loc[[-1,0]]Out[101]:-1    z 0    adtype: stringIn [102]:s.str.cat([u.loc[[3]],v.loc[[-1,0]]],join="right",na_rep="-")Out[102]: 3    dd--1    --z 0    a-adtype: string

Indexing with`.str`#

You can use[] notation to directly index by position locations. If you index past the endof the string, the result will be aNaN.

In [103]:s=pd.Series(   .....:["A","B","C","Aaba","Baca",np.nan,"CABA","dog","cat"],dtype="string"   .....:)   .....:In [104]:s.str[0]Out[104]:0       A1       B2       C3       A4       B5    <NA>6       C7       d8       cdtype: stringIn [105]:s.str[1]Out[105]:0    <NA>1    <NA>2    <NA>3       a4       a5    <NA>6       A7       o8       adtype: string

Extracting substrings#

Extract first match in each subject (extract)#

Theextract method accepts aregular expression with at least onecapture group.

Extracting a regular expression with more than one group returns aDataFrame with one column per group.

In [106]:pd.Series(   .....:["a1","b2","c3"],   .....:dtype="string",   .....:).str.extract(r"([ab])(\d)",expand=False)   .....:Out[106]:      0     10     a     11     b     22  <NA>  <NA>

Elements that do not match return a row filled withNaN. Thus, aSeries of messy strings can be “converted” into a like-indexed Seriesor DataFrame of cleaned-up or more useful strings, withoutnecessitatingget() to access tuples orre.match objects. Thedtype of the result is always object, even if no match is found andthe result only containsNaN.

Named groups like

In [107]:pd.Series(["a1","b2","c3"],dtype="string").str.extract(   .....:r"(?P<letter>[ab])(?P<digit>\d)",expand=False   .....:)   .....:Out[107]:  letter digit0      a     11      b     22   <NA>  <NA>

and optional groups like

In [108]:pd.Series(   .....:["a1","b2","3"],   .....:dtype="string",   .....:).str.extract(r"([ab])?(\d)",expand=False)   .....:Out[108]:      0  10     a  11     b  22  <NA>  3

can also be used. Note that any capture group names in the regularexpression will be used for column names; otherwise capture groupnumbers will be used.

Extracting a regular expression with one group returns aDataFramewith one column ifexpand=True.

In [109]:pd.Series(["a1","b2","c3"],dtype="string").str.extract(r"[ab](\d)",expand=True)Out[109]:      00     11     22  <NA>

It returns a Series ifexpand=False.

In [110]:pd.Series(["a1","b2","c3"],dtype="string").str.extract(r"[ab](\d)",expand=False)Out[110]:0       11       22    <NA>dtype: string

Calling on anIndex with a regex with exactly one capture groupreturns aDataFrame with one column ifexpand=True.

In [111]:s=pd.Series(["a1","b2","c3"],["A11","B22","C33"],dtype="string")In [112]:sOut[112]:A11    a1B22    b2C33    c3dtype: stringIn [113]:s.index.str.extract("(?P<letter>[a-zA-Z])",expand=True)Out[113]:  letter0      A1      B2      C

It returns anIndex ifexpand=False.

In [114]:s.index.str.extract("(?P<letter>[a-zA-Z])",expand=False)Out[114]:Index(['A', 'B', 'C'], dtype='object', name='letter')

Calling on anIndex with a regex with more than one capture groupreturns aDataFrame ifexpand=True.

In [115]:s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=True)Out[115]:  letter   10      A  111      B  222      C  33

It raisesValueError ifexpand=False.

In [116]:s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=False)---------------------------------------------------------------------------ValueErrorTraceback (most recent call last)CellIn[116],line1---->1s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=False)File ~/work/pandas/pandas/pandas/core/strings/accessor.py:140, inforbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)135msg=(136f"Cannot use .str.{func_name} with values of "137f"inferred dtype '{self._inferred_dtype}'."138)139raiseTypeError(msg)-->140returnfunc(self,*args,**kwargs)File ~/work/pandas/pandas/pandas/core/strings/accessor.py:2771, inStringMethods.extract(self, pat, flags, expand)2768raiseValueError("pattern contains no capture groups")2770ifnotexpandandregex.groups>1andisinstance(self._data,ABCIndex):->2771raiseValueError("only one regex group is supported with Index")2773obj=self._data2774result_dtype=_result_dtype(obj)ValueError: only one regex group is supported with Index

The table below summarizes the behavior ofextract(expand=False)(input subject in first column, number of groups in regex infirst row)

	1 group	>1 group
Index	Index	ValueError
Series	Series	DataFrame

Extract all matches in each subject (extractall)#

Unlikeextract (which returns only the first match),

In [117]:s=pd.Series(["a1a2","b1","c1"],index=["A","B","C"],dtype="string")In [118]:sOut[118]:A    a1a2B      b1C      c1dtype: stringIn [119]:two_groups="(?P<letter>[a-z])(?P<digit>[0-9])"In [120]:s.str.extract(two_groups,expand=True)Out[120]:  letter digitA      a     1B      b     1C      c     1

theextractall method returns every match. The result ofextractall is always aDataFrame with aMultiIndex on itsrows. The last level of theMultiIndex is namedmatch andindicates the order in the subject.

In [121]:s.str.extractall(two_groups)Out[121]:        letter digit  matchA 0          a     1  1          a     2B 0          b     1C 0          c     1

When each subject string in the Series has exactly one match,

In [122]:s=pd.Series(["a3","b3","c2"],dtype="string")In [123]:sOut[123]:0    a31    b32    c2dtype: string

thenextractall(pat).xs(0,level='match') gives the same result asextract(pat).

In [124]:extract_result=s.str.extract(two_groups,expand=True)In [125]:extract_resultOut[125]:  letter digit0      a     31      b     32      c     2In [126]:extractall_result=s.str.extractall(two_groups)In [127]:extractall_resultOut[127]:        letter digit  match0 0          a     31 0          b     32 0          c     2In [128]:extractall_result.xs(0,level="match")Out[128]:  letter digit0      a     31      b     32      c     2

Index also supports.str.extractall. It returns aDataFrame which has thesame result as aSeries.str.extractall with a default index (starts from 0).

In [129]:pd.Index(["a1a2","b1","c1"]).str.extractall(two_groups)Out[129]:        letter digit  match0 0          a     1  1          a     21 0          b     12 0          c     1In [130]:pd.Series(["a1a2","b1","c1"],dtype="string").str.extractall(two_groups)Out[130]:        letter digit  match0 0          a     1  1          a     21 0          b     12 0          c     1

Testing for strings that match or contain a pattern#

You can check whether elements contain a pattern:

In [131]:pattern=r"[0-9][a-z]"In [132]:pd.Series(   .....:["1","2","3a","3b","03c","4dx"],   .....:dtype="string",   .....:).str.contains(pattern)   .....:Out[132]:0    False1    False2     True3     True4     True5     Truedtype: boolean

Or whether elements match a pattern:

In [133]:pd.Series(   .....:["1","2","3a","3b","03c","4dx"],   .....:dtype="string",   .....:).str.match(pattern)   .....:Out[133]:0    False1    False2     True3     True4    False5     Truedtype: boolean

In [134]:pd.Series(   .....:["1","2","3a","3b","03c","4dx"],   .....:dtype="string",   .....:).str.fullmatch(pattern)   .....:Out[134]:0    False1    False2     True3     True4    False5    Falsedtype: boolean

Note

The distinction betweenmatch,fullmatch, andcontains is strictness:fullmatch tests whether the entire string matches the regular expression;match tests whether there is a match of the regular expression that beginsat the first character of the string; andcontains tests whether there isa match of the regular expression at any position within the string.

The corresponding functions in there package for these three match modes arere.fullmatch,re.match, andre.search,respectively.

Methods likematch,fullmatch,contains,startswith, andendswith take an extrana argument so missing values can be consideredTrue or False:

In [135]:s4=pd.Series(   .....:["A","B","C","Aaba","Baca",np.nan,"CABA","dog","cat"],dtype="string"   .....:)   .....:In [136]:s4.str.contains("A",na=False)Out[136]:0     True1    False2    False3     True4    False5    False6     True7    False8    Falsedtype: boolean

Creating indicator variables#

You can extract dummy variables from string columns.For example if they are separated by a'|':

In [137]:s=pd.Series(["a","a|b",np.nan,"a|c"],dtype="string")In [138]:s.str.get_dummies(sep="|")Out[138]:   a  b  c0  1  0  01  1  1  02  0  0  03  1  0  1

StringIndex also supportsget_dummies which returns aMultiIndex.

In [139]:idx=pd.Index(["a","a|b",np.nan,"a|c"])In [140]:idx.str.get_dummies(sep="|")Out[140]:MultiIndex([(1, 0, 0),            (1, 1, 0),            (0, 0, 0),            (1, 0, 1)],           names=['a', 'b', 'c'])

Method summary#

Method	Description
`cat()`	Concatenate strings
`split()`	Split strings on delimiter
`rsplit()`	Split strings on delimiter working from the end of the string
`get()`	Index into each element (retrieve i-th element)
`join()`	Join strings in each element of the Series with passed separator
`get_dummies()`	Split strings on the delimiter returning DataFrame of dummy variables
`contains()`	Return boolean array if each string contains pattern/regex
`replace()`	Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence
`removeprefix()`	Remove prefix from string i.e. only remove if string starts with prefix.
`removesuffix()`	Remove suffix from string i.e. only remove if string ends with suffix.
`repeat()`	Duplicate values (`s.str.repeat(3)` equivalent to`x*3`)
`pad()`	Add whitespace to the sides of strings
`center()`	Equivalent to`str.center`
`ljust()`	Equivalent to`str.ljust`
`rjust()`	Equivalent to`str.rjust`
`zfill()`	Equivalent to`str.zfill`
`wrap()`	Split long strings into lines with length less than a given width
`slice()`	Slice each string in the Series
`slice_replace()`	Replace slice in each string with passed value
`count()`	Count occurrences of pattern
`startswith()`	Equivalent to`str.startswith(pat)` for each element
`endswith()`	Equivalent to`str.endswith(pat)` for each element
`findall()`	Compute list of all occurrences of pattern/regex for each string
`match()`	Call`re.match` on each element returning matched groups as list
`extract()`	Call`re.search` on each element returning DataFrame with one row for each element and one column for each regex capture group
`extractall()`	Call`re.findall` on each element returning DataFrame with one row for each match and one column for each regex capture group
`len()`	Compute string lengths
`strip()`	Equivalent to`str.strip`
`rstrip()`	Equivalent to`str.rstrip`
`lstrip()`	Equivalent to`str.lstrip`
`partition()`	Equivalent to`str.partition`
`rpartition()`	Equivalent to`str.rpartition`
`lower()`	Equivalent to`str.lower`
`casefold()`	Equivalent to`str.casefold`
`upper()`	Equivalent to`str.upper`
`find()`	Equivalent to`str.find`
`rfind()`	Equivalent to`str.rfind`
`index()`	Equivalent to`str.index`
`rindex()`	Equivalent to`str.rindex`
`capitalize()`	Equivalent to`str.capitalize`
`swapcase()`	Equivalent to`str.swapcase`
`normalize()`	Return Unicode normal form. Equivalent to`unicodedata.normalize`
`translate()`	Equivalent to`str.translate`
`isalnum()`	Equivalent to`str.isalnum`
`isalpha()`	Equivalent to`str.isalpha`
`isdigit()`	Equivalent to`str.isdigit`
`isspace()`	Equivalent to`str.isspace`
`islower()`	Equivalent to`str.islower`
`isupper()`	Equivalent to`str.isupper`
`istitle()`	Equivalent to`str.istitle`
`isnumeric()`	Equivalent to`str.isnumeric`
`isdecimal()`	Equivalent to`str.isdecimal`

On this page

Show Source

Movatterモバイル変換

Working with text data#