- User Guide
- Working...
Working with text data#
Text data types#
There are two ways to store text data in pandas:
object
-dtype NumPy array.StringDtype
extension type.
We recommend usingStringDtype
to store text data.
Prior to pandas 1.0,object
dtype was the only option. This was unfortunatefor many reasons:
You can accidentally store amixture of strings and non-strings in an
object
dtype array. It’s better to have a dedicated dtype.object
dtype breaks dtype-specific operations likeDataFrame.select_dtypes()
.There isn’t a clear way to selectjust text while excluding non-textbut still object-dtype columns.When reading code, the contents of an
object
dtype array is less clearthan'string'
.
Currently, the performance ofobject
dtype arrays of strings andarrays.StringArray
are about the same. We expect future enhancementsto significantly increase the performance and lower the memory overhead ofStringArray
.
Warning
StringArray
is currently considered experimental. The implementationand parts of the API may change without warning.
For backwards-compatibility,object
dtype remains the default type weinfer a list of strings to
In [1]:pd.Series(["a","b","c"])Out[1]:0 a1 b2 cdtype: object
To explicitly requeststring
dtype, specify thedtype
In [2]:pd.Series(["a","b","c"],dtype="string")Out[2]:0 a1 b2 cdtype: stringIn [3]:pd.Series(["a","b","c"],dtype=pd.StringDtype())Out[3]:0 a1 b2 cdtype: string
Orastype
after theSeries
orDataFrame
is created
In [4]:s=pd.Series(["a","b","c"])In [5]:sOut[5]:0 a1 b2 cdtype: objectIn [6]:s.astype("string")Out[6]:0 a1 b2 cdtype: string
You can also useStringDtype
/"string"
as the dtype on non-string data andit will be converted tostring
dtype:
In [7]:s=pd.Series(["a",2,np.nan],dtype="string")In [8]:sOut[8]:0 a1 22 <NA>dtype: stringIn [9]:type(s[1])Out[9]:str
or convert from existing pandas data:
In [10]:s1=pd.Series([1,2,np.nan],dtype="Int64")In [11]:s1Out[11]:0 11 22 <NA>dtype: Int64In [12]:s2=s1.astype("string")In [13]:s2Out[13]:0 11 22 <NA>dtype: stringIn [14]:type(s2[0])Out[14]:str
Behavior differences#
These are places where the behavior ofStringDtype
objects differ fromobject
dtype
For
StringDtype
,string accessor methodsthat returnnumeric output will always return a nullable integer dtype,rather than either int or float dtype, depending on the presence of NA values.Methods returningboolean output will return a nullable boolean dtype.In [15]:s=pd.Series(["a",None,"b"],dtype="string")In [16]:sOut[16]:0 a1 <NA>2 bdtype: stringIn [17]:s.str.count("a")Out[17]:0 11 <NA>2 0dtype: Int64In [18]:s.dropna().str.count("a")Out[18]:0 12 0dtype: Int64
Both outputs are
Int64
dtype. Compare that with object-dtypeIn [19]:s2=pd.Series(["a",None,"b"],dtype="object")In [20]:s2.str.count("a")Out[20]:0 1.01 NaN2 0.0dtype: float64In [21]:s2.dropna().str.count("a")Out[21]:0 12 0dtype: int64
When NA values are present, the output dtype is float64. Similarly formethods returning boolean values.
In [22]:s.str.isdigit()Out[22]:0 False1 <NA>2 Falsedtype: booleanIn [23]:s.str.match("a")Out[23]:0 True1 <NA>2 Falsedtype: boolean
Some string methods, like
Series.str.decode()
are not availableonStringArray
becauseStringArray
only holds strings, notbytes.In comparison operations,
arrays.StringArray
andSeries
backedby aStringArray
will return an object withBooleanDtype
,rather than abool
dtype object. Missing values in aStringArray
will propagate in comparison operations, rather than always comparingunequal likenumpy.nan
.
Everything else that follows in the rest of this document applies equally tostring
andobject
dtype.
String methods#
Series and Index are equipped with a set of string processing methodsthat make it easy to operate on each element of the array. Perhaps mostimportantly, these methods exclude missing/NA values automatically. These areaccessed via thestr
attribute and generally have names matchingthe equivalent (scalar) built-in string methods:
In [24]:s=pd.Series( ....:["A","B","C","Aaba","Baca",np.nan,"CABA","dog","cat"],dtype="string" ....:) ....:In [25]:s.str.lower()Out[25]:0 a1 b2 c3 aaba4 baca5 <NA>6 caba7 dog8 catdtype: stringIn [26]:s.str.upper()Out[26]:0 A1 B2 C3 AABA4 BACA5 <NA>6 CABA7 DOG8 CATdtype: stringIn [27]:s.str.len()Out[27]:0 11 12 13 44 45 <NA>6 47 38 3dtype: Int64
In [28]:idx=pd.Index([" jack","jill "," jesse ","frank"])In [29]:idx.str.strip()Out[29]:Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')In [30]:idx.str.lstrip()Out[30]:Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')In [31]:idx.str.rstrip()Out[31]:Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
The string methods on Index are especially useful for cleaning up ortransforming DataFrame columns. For instance, you may have columns withleading or trailing whitespace:
In [32]:df=pd.DataFrame( ....:np.random.randn(3,2),columns=[" Column A "," Column B "],index=range(3) ....:) ....:In [33]:dfOut[33]: Column A Column B0 0.469112 -0.2828631 -1.509059 -1.1356322 1.212112 -0.173215
Sincedf.columns
is an Index object, we can use the.str
accessor
In [34]:df.columns.str.strip()Out[34]:Index(['Column A', 'Column B'], dtype='object')In [35]:df.columns.str.lower()Out[35]:Index([' column a ', ' column b '], dtype='object')
These string methods can then be used to clean up the columns as needed.Here we are removing leading and trailing whitespaces, lower casing all names,and replacing any remaining whitespaces with underscores:
In [36]:df.columns=df.columns.str.strip().str.lower().str.replace(" ","_")In [37]:dfOut[37]: column_a column_b0 0.469112 -0.2828631 -1.509059 -1.1356322 1.212112 -0.173215
Note
If you have aSeries
where lots of elements are repeated(i.e. the number of unique elements in theSeries
is a lot smaller than the length of theSeries
), it can be faster to convert the originalSeries
to one of typecategory
and then use.str.<method>
or.dt.<property>
on that.The performance difference comes from the fact that, forSeries
of typecategory
, thestring operations are done on the.categories
and not on each element of theSeries
.
Please note that aSeries
of typecategory
with string.categories
hassome limitations in comparison toSeries
of type string (e.g. you can’t add strings toeach other:s+""+s
won’t work ifs
is aSeries
of typecategory
). Also,.str
methods which operate on elements of typelist
are not available on such aSeries
.
Warning
The type of the Series is inferred and the allowed types (i.e. strings).
Generally speaking, the.str
accessor is intended to work only on strings. With very fewexceptions, other uses are not supported, and may be disabled at a later point.
Splitting and replacing strings#
Methods likesplit
return a Series of lists:
In [38]:s2=pd.Series(["a_b_c","c_d_e",np.nan,"f_g_h"],dtype="string")In [39]:s2.str.split("_")Out[39]:0 [a, b, c]1 [c, d, e]2 <NA>3 [f, g, h]dtype: object
Elements in the split lists can be accessed usingget
or[]
notation:
In [40]:s2.str.split("_").str.get(1)Out[40]:0 b1 d2 <NA>3 gdtype: objectIn [41]:s2.str.split("_").str[1]Out[41]:0 b1 d2 <NA>3 gdtype: object
It is easy to expand this to return a DataFrame usingexpand
.
In [42]:s2.str.split("_",expand=True)Out[42]: 0 1 20 a b c1 c d e2 <NA> <NA> <NA>3 f g h
When originalSeries
hasStringDtype
, the output columns will allbeStringDtype
as well.
It is also possible to limit the number of splits:
In [43]:s2.str.split("_",expand=True,n=1)Out[43]: 0 10 a b_c1 c d_e2 <NA> <NA>3 f g_h
rsplit
is similar tosplit
except it works in the reverse direction,i.e., from the end of the string to the beginning of the string:
In [44]:s2.str.rsplit("_",expand=True,n=1)Out[44]: 0 10 a_b c1 c_d e2 <NA> <NA>3 f_g h
replace
optionally usesregular expressions:
In [45]:s3=pd.Series( ....:["A","B","C","Aaba","Baca","",np.nan,"CABA","dog","cat"], ....:dtype="string", ....:) ....:In [46]:s3Out[46]:0 A1 B2 C3 Aaba4 Baca56 <NA>7 CABA8 dog9 catdtype: stringIn [47]:s3.str.replace("^.a|dog","XX-XX ",case=False,regex=True)Out[47]:0 A1 B2 C3 XX-XX ba4 XX-XX ca56 <NA>7 XX-XX BA8 XX-XX9 XX-XX tdtype: string
Changed in version 2.0.
Single character pattern withregex=True
will also be treated as regular expressions:
In [48]:s4=pd.Series(["a.b",".","b",np.nan,""],dtype="string")In [49]:s4Out[49]:0 a.b1 .2 b3 <NA>4dtype: stringIn [50]:s4.str.replace(".","a",regex=True)Out[50]:0 aaa1 a2 a3 <NA>4dtype: string
If you want literal replacement of a string (equivalent tostr.replace()
), youcan set the optionalregex
parameter toFalse
, rather than escaping eachcharacter. In this case bothpat
andrepl
must be strings:
In [51]:dollars=pd.Series(["12","-$10","$10,000"],dtype="string")# These lines are equivalentIn [52]:dollars.str.replace(r"-\$","-",regex=True)Out[52]:0 121 -102 $10,000dtype: stringIn [53]:dollars.str.replace("-$","-",regex=False)Out[53]:0 121 -102 $10,000dtype: string
Thereplace
method can also take a callable as replacement. It is calledon everypat
usingre.sub()
. The callable should expect onepositional argument (a regex object) and return a string.
# Reverse every lowercase alphabetic wordIn [54]:pat=r"[a-z]+"In [55]:defrepl(m): ....:returnm.group(0)[::-1] ....:In [56]:pd.Series(["foo 123","bar baz",np.nan],dtype="string").str.replace( ....:pat,repl,regex=True ....:) ....:Out[56]:0 oof 1231 rab zab2 <NA>dtype: string# Using regex groupsIn [57]:pat=r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"In [58]:defrepl(m): ....:returnm.group("two").swapcase() ....:In [59]:pd.Series(["Foo Bar Baz",np.nan],dtype="string").str.replace( ....:pat,repl,regex=True ....:) ....:Out[59]:0 bAR1 <NA>dtype: string
Thereplace
method also accepts a compiled regular expression objectfromre.compile()
as a pattern. All flags should be included in thecompiled regular expression object.
In [60]:importreIn [61]:regex_pat=re.compile(r"^.a|dog",flags=re.IGNORECASE)In [62]:s3.str.replace(regex_pat,"XX-XX ",regex=True)Out[62]:0 A1 B2 C3 XX-XX ba4 XX-XX ca56 <NA>7 XX-XX BA8 XX-XX9 XX-XX tdtype: string
Including aflags
argument when callingreplace
with a compiledregular expression object will raise aValueError
.
In [63]:s3.str.replace(regex_pat,'XX-XX ',flags=re.IGNORECASE)---------------------------------------------------------------------------ValueError: case and flags cannot be set when pat is a compiled regex
removeprefix
andremovesuffix
have the same effect asstr.removeprefix
andstr.removesuffix
added in Python 3.9<https://docs.python.org/3/library/stdtypes.html#str.removeprefix>`__:
Added in version 1.4.0.
In [64]:s=pd.Series(["str_foo","str_bar","no_prefix"])In [65]:s.str.removeprefix("str_")Out[65]:0 foo1 bar2 no_prefixdtype: objectIn [66]:s=pd.Series(["foo_str","bar_str","no_suffix"])In [67]:s.str.removesuffix("_str")Out[67]:0 foo1 bar2 no_suffixdtype: object
Concatenation#
There are several ways to concatenate aSeries
orIndex
, either with itself or others, all based oncat()
,resp.Index.str.cat
.
Concatenating a single Series into a string#
The content of aSeries
(orIndex
) can be concatenated:
In [68]:s=pd.Series(["a","b","c","d"],dtype="string")In [69]:s.str.cat(sep=",")Out[69]:'a,b,c,d'
If not specified, the keywordsep
for the separator defaults to the empty string,sep=''
:
In [70]:s.str.cat()Out[70]:'abcd'
By default, missing values are ignored. Usingna_rep
, they can be given a representation:
In [71]:t=pd.Series(["a","b",np.nan,"d"],dtype="string")In [72]:t.str.cat(sep=",")Out[72]:'a,b,d'In [73]:t.str.cat(sep=",",na_rep="-")Out[73]:'a,b,-,d'
Concatenating a Series and something list-like into a Series#
The first argument tocat()
can be a list-like object, provided that it matches the length of the callingSeries
(orIndex
).
In [74]:s.str.cat(["A","B","C","D"])Out[74]:0 aA1 bB2 cC3 dDdtype: string
Missing values on either side will result in missing values in the result as well,unlessna_rep
is specified:
In [75]:s.str.cat(t)Out[75]:0 aa1 bb2 <NA>3 dddtype: stringIn [76]:s.str.cat(t,na_rep="-")Out[76]:0 aa1 bb2 c-3 dddtype: string
Concatenating a Series and something array-like into a Series#
The parameterothers
can also be two-dimensional. In this case, the number or rows must match the lengths of the callingSeries
(orIndex
).
In [77]:d=pd.concat([t,s],axis=1)In [78]:sOut[78]:0 a1 b2 c3 ddtype: stringIn [79]:dOut[79]: 0 10 a a1 b b2 <NA> c3 d dIn [80]:s.str.cat(d,na_rep="-")Out[80]:0 aaa1 bbb2 c-c3 ddddtype: string
Concatenating a Series and an indexed object into a Series, with alignment#
For concatenation with aSeries
orDataFrame
, it is possible to align the indexes before concatenation by settingthejoin
-keyword.
In [81]:u=pd.Series(["b","d","a","c"],index=[1,3,0,2],dtype="string")In [82]:sOut[82]:0 a1 b2 c3 ddtype: stringIn [83]:uOut[83]:1 b3 d0 a2 cdtype: stringIn [84]:s.str.cat(u)Out[84]:0 aa1 bb2 cc3 dddtype: stringIn [85]:s.str.cat(u,join="left")Out[85]:0 aa1 bb2 cc3 dddtype: string
The usual options are available forjoin
(one of'left','outer','inner','right'
).In particular, alignment also means that the different lengths do not need to coincide anymore.
In [86]:v=pd.Series(["z","a","b","d","e"],index=[-1,0,1,3,4],dtype="string")In [87]:sOut[87]:0 a1 b2 c3 ddtype: stringIn [88]:vOut[88]:-1 z 0 a 1 b 3 d 4 edtype: stringIn [89]:s.str.cat(v,join="left",na_rep="-")Out[89]:0 aa1 bb2 c-3 dddtype: stringIn [90]:s.str.cat(v,join="outer",na_rep="-")Out[90]:-1 -z 0 aa 1 bb 2 c- 3 dd 4 -edtype: string
The same alignment can be used whenothers
is aDataFrame
:
In [91]:f=d.loc[[3,2,1,0],:]In [92]:sOut[92]:0 a1 b2 c3 ddtype: stringIn [93]:fOut[93]: 0 13 d d2 <NA> c1 b b0 a aIn [94]:s.str.cat(f,join="left",na_rep="-")Out[94]:0 aaa1 bbb2 c-c3 ddddtype: string
Concatenating a Series and many objects into a Series#
Several array-like items (specifically:Series
,Index
, and 1-dimensional variants ofnp.ndarray
)can be combined in a list-like container (including iterators,dict
-views, etc.).
In [95]:sOut[95]:0 a1 b2 c3 ddtype: stringIn [96]:uOut[96]:1 b3 d0 a2 cdtype: stringIn [97]:s.str.cat([u,u.to_numpy()],join="left")Out[97]:0 aab1 bbd2 cca3 ddcdtype: string
All elements without an index (e.g.np.ndarray
) within the passed list-like must match in length to the callingSeries
(orIndex
),butSeries
andIndex
may have arbitrary length (as long as alignment is not disabled withjoin=None
):
In [98]:vOut[98]:-1 z 0 a 1 b 3 d 4 edtype: stringIn [99]:s.str.cat([v,u,u.to_numpy()],join="outer",na_rep="-")Out[99]:-1 -z--0 aaab1 bbbd2 c-ca3 dddc4 -e--dtype: string
If usingjoin='right'
on a list-like ofothers
that contains different indexes,the union of these indexes will be used as the basis for the final concatenation:
In [100]:u.loc[[3]]Out[100]:3 ddtype: stringIn [101]:v.loc[[-1,0]]Out[101]:-1 z 0 adtype: stringIn [102]:s.str.cat([u.loc[[3]],v.loc[[-1,0]]],join="right",na_rep="-")Out[102]: 3 dd--1 --z 0 a-adtype: string
Indexing with.str
#
You can use[]
notation to directly index by position locations. If you index past the endof the string, the result will be aNaN
.
In [103]:s=pd.Series( .....:["A","B","C","Aaba","Baca",np.nan,"CABA","dog","cat"],dtype="string" .....:) .....:In [104]:s.str[0]Out[104]:0 A1 B2 C3 A4 B5 <NA>6 C7 d8 cdtype: stringIn [105]:s.str[1]Out[105]:0 <NA>1 <NA>2 <NA>3 a4 a5 <NA>6 A7 o8 adtype: string
Extracting substrings#
Extract first match in each subject (extract)#
Theextract
method accepts aregular expression with at least onecapture group.
Extracting a regular expression with more than one group returns aDataFrame with one column per group.
In [106]:pd.Series( .....:["a1","b2","c3"], .....:dtype="string", .....:).str.extract(r"([ab])(\d)",expand=False) .....:Out[106]: 0 10 a 11 b 22 <NA> <NA>
Elements that do not match return a row filled withNaN
. Thus, aSeries of messy strings can be “converted” into a like-indexed Seriesor DataFrame of cleaned-up or more useful strings, withoutnecessitatingget()
to access tuples orre.match
objects. Thedtype of the result is always object, even if no match is found andthe result only containsNaN
.
Named groups like
In [107]:pd.Series(["a1","b2","c3"],dtype="string").str.extract( .....:r"(?P<letter>[ab])(?P<digit>\d)",expand=False .....:) .....:Out[107]: letter digit0 a 11 b 22 <NA> <NA>
and optional groups like
In [108]:pd.Series( .....:["a1","b2","3"], .....:dtype="string", .....:).str.extract(r"([ab])?(\d)",expand=False) .....:Out[108]: 0 10 a 11 b 22 <NA> 3
can also be used. Note that any capture group names in the regularexpression will be used for column names; otherwise capture groupnumbers will be used.
Extracting a regular expression with one group returns aDataFrame
with one column ifexpand=True
.
In [109]:pd.Series(["a1","b2","c3"],dtype="string").str.extract(r"[ab](\d)",expand=True)Out[109]: 00 11 22 <NA>
It returns a Series ifexpand=False
.
In [110]:pd.Series(["a1","b2","c3"],dtype="string").str.extract(r"[ab](\d)",expand=False)Out[110]:0 11 22 <NA>dtype: string
Calling on anIndex
with a regex with exactly one capture groupreturns aDataFrame
with one column ifexpand=True
.
In [111]:s=pd.Series(["a1","b2","c3"],["A11","B22","C33"],dtype="string")In [112]:sOut[112]:A11 a1B22 b2C33 c3dtype: stringIn [113]:s.index.str.extract("(?P<letter>[a-zA-Z])",expand=True)Out[113]: letter0 A1 B2 C
It returns anIndex
ifexpand=False
.
In [114]:s.index.str.extract("(?P<letter>[a-zA-Z])",expand=False)Out[114]:Index(['A', 'B', 'C'], dtype='object', name='letter')
Calling on anIndex
with a regex with more than one capture groupreturns aDataFrame
ifexpand=True
.
In [115]:s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=True)Out[115]: letter 10 A 111 B 222 C 33
It raisesValueError
ifexpand=False
.
In [116]:s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=False)---------------------------------------------------------------------------ValueErrorTraceback (most recent call last)CellIn[116],line1---->1s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=False)File ~/work/pandas/pandas/pandas/core/strings/accessor.py:137, inforbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)132msg=(133f"Cannot use .str.{func_name} with values of "134f"inferred dtype '{self._inferred_dtype}'."135)136raiseTypeError(msg)-->137returnfunc(self,*args,**kwargs)File ~/work/pandas/pandas/pandas/core/strings/accessor.py:2743, inStringMethods.extract(self, pat, flags, expand)2740raiseValueError("pattern contains no capture groups")2742ifnotexpandandregex.groups>1andisinstance(self._data,ABCIndex):->2743raiseValueError("only one regex group is supported with Index")2745obj=self._data2746result_dtype=_result_dtype(obj)ValueError: only one regex group is supported with Index
The table below summarizes the behavior ofextract(expand=False)
(input subject in first column, number of groups in regex infirst row)
1 group | >1 group | |
Index | Index | ValueError |
Series | Series | DataFrame |
Extract all matches in each subject (extractall)#
Unlikeextract
(which returns only the first match),
In [117]:s=pd.Series(["a1a2","b1","c1"],index=["A","B","C"],dtype="string")In [118]:sOut[118]:A a1a2B b1C c1dtype: stringIn [119]:two_groups="(?P<letter>[a-z])(?P<digit>[0-9])"In [120]:s.str.extract(two_groups,expand=True)Out[120]: letter digitA a 1B b 1C c 1
theextractall
method returns every match. The result ofextractall
is always aDataFrame
with aMultiIndex
on itsrows. The last level of theMultiIndex
is namedmatch
andindicates the order in the subject.
In [121]:s.str.extractall(two_groups)Out[121]: letter digit matchA 0 a 1 1 a 2B 0 b 1C 0 c 1
When each subject string in the Series has exactly one match,
In [122]:s=pd.Series(["a3","b3","c2"],dtype="string")In [123]:sOut[123]:0 a31 b32 c2dtype: string
thenextractall(pat).xs(0,level='match')
gives the same result asextract(pat)
.
In [124]:extract_result=s.str.extract(two_groups,expand=True)In [125]:extract_resultOut[125]: letter digit0 a 31 b 32 c 2In [126]:extractall_result=s.str.extractall(two_groups)In [127]:extractall_resultOut[127]: letter digit match0 0 a 31 0 b 32 0 c 2In [128]:extractall_result.xs(0,level="match")Out[128]: letter digit0 a 31 b 32 c 2
Index
also supports.str.extractall
. It returns aDataFrame
which has thesame result as aSeries.str.extractall
with a default index (starts from 0).
In [129]:pd.Index(["a1a2","b1","c1"]).str.extractall(two_groups)Out[129]: letter digit match0 0 a 1 1 a 21 0 b 12 0 c 1In [130]:pd.Series(["a1a2","b1","c1"],dtype="string").str.extractall(two_groups)Out[130]: letter digit match0 0 a 1 1 a 21 0 b 12 0 c 1
Testing for strings that match or contain a pattern#
You can check whether elements contain a pattern:
In [131]:pattern=r"[0-9][a-z]"In [132]:pd.Series( .....:["1","2","3a","3b","03c","4dx"], .....:dtype="string", .....:).str.contains(pattern) .....:Out[132]:0 False1 False2 True3 True4 True5 Truedtype: boolean
Or whether elements match a pattern:
In [133]:pd.Series( .....:["1","2","3a","3b","03c","4dx"], .....:dtype="string", .....:).str.match(pattern) .....:Out[133]:0 False1 False2 True3 True4 False5 Truedtype: boolean
In [134]:pd.Series( .....:["1","2","3a","3b","03c","4dx"], .....:dtype="string", .....:).str.fullmatch(pattern) .....:Out[134]:0 False1 False2 True3 True4 False5 Falsedtype: boolean
Note
The distinction betweenmatch
,fullmatch
, andcontains
is strictness:fullmatch
tests whether the entire string matches the regular expression;match
tests whether there is a match of the regular expression that beginsat the first character of the string; andcontains
tests whether there isa match of the regular expression at any position within the string.
The corresponding functions in there
package for these three match modes arere.fullmatch,re.match, andre.search,respectively.
Methods likematch
,fullmatch
,contains
,startswith
, andendswith
take an extrana
argument so missing values can be consideredTrue or False:
In [135]:s4=pd.Series( .....:["A","B","C","Aaba","Baca",np.nan,"CABA","dog","cat"],dtype="string" .....:) .....:In [136]:s4.str.contains("A",na=False)Out[136]:0 True1 False2 False3 True4 False5 False6 True7 False8 Falsedtype: boolean
Creating indicator variables#
You can extract dummy variables from string columns.For example if they are separated by a'|'
:
In [137]:s=pd.Series(["a","a|b",np.nan,"a|c"],dtype="string")In [138]:s.str.get_dummies(sep="|")Out[138]: a b c0 1 0 01 1 1 02 0 0 03 1 0 1
StringIndex
also supportsget_dummies
which returns aMultiIndex
.
In [139]:idx=pd.Index(["a","a|b",np.nan,"a|c"])In [140]:idx.str.get_dummies(sep="|")Out[140]:MultiIndex([(1, 0, 0), (1, 1, 0), (0, 0, 0), (1, 0, 1)], names=['a', 'b', 'c'])
See alsoget_dummies()
.
Method summary#
Method | Description |
---|---|
Concatenate strings | |
Split strings on delimiter | |
Split strings on delimiter working from the end of the string | |
Index into each element (retrieve i-th element) | |
Join strings in each element of the Series with passed separator | |
Split strings on the delimiter returning DataFrame of dummy variables | |
Return boolean array if each string contains pattern/regex | |
Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence | |
Remove prefix from string i.e. only remove if string starts with prefix. | |
Remove suffix from string i.e. only remove if string ends with suffix. | |
Duplicate values ( | |
Add whitespace to the sides of strings | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Split long strings into lines with length less than a given width | |
Slice each string in the Series | |
Replace slice in each string with passed value | |
Count occurrences of pattern | |
Equivalent to | |
Equivalent to | |
Compute list of all occurrences of pattern/regex for each string | |
Call | |
Call | |
Call | |
Compute string lengths | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Return Unicode normal form. Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to | |
Equivalent to |