- User Guide
- Migration...
Migration guide for the new string data type (pandas 3.0)#
The upcoming pandas 3.0 release introduces a new, default string data type. Thiswill most likely cause some work when upgrading to pandas 3.0, and this pageprovides an overview of the issues you might run into and gives guidance on howto address them.
This new dtype is already available in the pandas 2.3 release, and you canenable it with:
pd.options.future.infer_string=True
This allows you to test your code before the final 3.0 release.
Background#
Historically, pandas has always used the NumPyobject dtype as the defaultto store text data. This has two primary drawbacks. First,object dtype isnot specific to strings: any Python object can be stored in anobject-dtypearray, not just strings, and seeingobject as the dtype for a column withstrings is confusing for users. Second, this is not always very efficient (bothperformance wise and for memory usage).
Since pandas 1.0, an opt-in string data type has been available, but this hasnot yet been made the default, and uses thepd.NA scalar to representmissing values.
Pandas 3.0 changes the default dtype for strings to a new string data type,a variant of the existing optional string data type but usingNaN as themissing value indicator, to be consistent with the other default data types.
To improve performance, the new string data type will use thepyarrowpackage by default, if installed (and otherwise it uses object dtype under thehood as a fallback).
SeePDEP-14: Dedicated string data type for pandas 3.0for more background and details.
Brief introduction to the new default string dtype#
By default, pandas will infer this new string dtype instead of object dtype forstring data (when creating pandas objects, such as in constructors or IOfunctions).
Being a default dtype means that the string dtype will be used in IO methods orconstructors when the dtype is being inferred and the input is inferred to bestring data:
>>>pd.Series(["a","b",None])0 a1 b2 NaNdtype: str
It can also be specified explicitly using the"str" alias:
>>>pd.Series(["a","b",None],dtype="str")0 a1 b2 NaNdtype: str
Similarly, functions likeread_csv(),read_parquet(), and otherswill now use the new string dtype when reading string data.
In contrast to the current object dtype, the new string dtype will only storestrings. This also means that it will raise an error if you try to store anon-string value in it (see below for more details).
Missing values with the new string dtype are always represented asNaN (np.nan),and the missing value behavior is similar to other default dtypes.
This new string dtype should otherwise behave the same as the existingobject dtype users are used to. For example, all string-specific methodsthrough thestr accessor will work the same:
>>>ser=pd.Series(["a","b",None],dtype="str")>>>ser.str.upper()0 A1 B2 NaNdtype: str
Note
The new default string dtype is an instance of thepandas.StringDtypeclass. The dtype can be constructed aspd.StringDtype(na_value=np.nan),but for general usage we recommend to use the shorter"str" alias.
Overview of behavior differences and how to address them#
The dtype is no longer a numpy “object” dtype#
When inferring or reading string data, the data type of the resulting DataFramecolumn or Series will silently start being the new"str" dtype instead ofthe numpy"object" dtype, and this can have some impact on your code.
The new string dtype is a pandas data type (“extension dtype”), and no longer anumpynp.dtype instance. Therefore, passing the dtype of a string column tonumpy functions will no longer work (e.g. passing it to adtype= argumentof a numpy function, or usingnp.issubdtype to check the dtype).
Checking the dtype#
When checking the dtype, code might currently do something like:
>>>ser=pd.Series(["a","b","c"])>>>ser.dtype=="object"
to check for columns with string data (by checking for the dtype being"object"). This will no longer work in pandas 3+, sinceser.dtype willnow be"str" with the new default string dtype, and the above check willreturnFalse.
To check for columns with string data, you should instead use:
>>>ser.dtype=="str"
How to write compatible code
For code that should work on both pandas 2.x and 3.x, you can use thepandas.api.types.is_string_dtype() function:
>>>pd.api.types.is_string_dtype(ser.dtype)True
This will returnTrue for both the object dtype and the string dtypes.
Hardcoded use of object dtype#
If you have code where the dtype is hardcoded in constructors, like
>>>pd.Series(["a","b","c"],dtype="object")
this will keep using the object dtype. You will want to update this code toensure you get the benefits of the new string dtype.
How to write compatible code?
First, in many cases it can be sufficient to remove the specific data type, andlet pandas do the inference. But if you want to be specific, you can specify the"str" dtype:
>>>pd.Series(["a","b","c"],dtype="str")
This is actually compatible with pandas 2.x as well, since in pandas < 3,dtype="str" was essentially treated as an alias for object dtype.
Attention
While usingdtype="str" in constructors is compatible with pandas 2.x,specifying it as the dtype inastype() runs into the issueof also stringifying missing values in pandas 2.x. See the sectionastype(str) preserving missing values for more details.
For selecting string columns withselect_dtypes() in a pandas2.x and 3.x compatible way, it is not possible to use"str". While thisworks for pandas 3.x, it raises an error in pandas 2.x.As an alternative, you can select bothobject (for pandas 2.x) and"string" (for pandas 3.x; which will also select the defaultstr dtypeand does not error on pandas 2.x):
# can use ``include=["str"]`` for pandas >= 3>>>df.select_dtypes(include=["object","string"])
The missing value sentinel is now always NaN#
When using object dtype, multiple possible missing value sentinels aresupported, includingNone andnp.nan. With the new default string dtype,the missing value sentinel is always NaN (np.nan):
# with object dtype, None is preserved as None and seen as missing>>>ser=pd.Series(["a","b",None],dtype="object")>>>ser0a1b2Nonedtype:object>>>print(ser[2])None# with the new string dtype, any missing value like None is coerced to NaN>>>ser=pd.Series(["a","b",None],dtype="str")>>>ser0a1b2NaNdtype:str>>>print(ser[2])nan
Generally this should be no problem when relying on missing value behavior inpandas methods (for example,ser.isna() will give the same result as before).But when you relied on the exact value ofNone being present, that canimpact your code.
How to write compatible code?
When checking for a missing value, instead of checking for the exact value ofNone ornp.nan, you should use thepandas.isna() function. This isthe most robust way to check for missing values, as it will work regardless ofthe dtype and the exact missing value sentinel:
>>>pd.isna(ser[2])True
One caveat: this function works both on scalars and on array-likes, and in thelatter case it will return an array of bools. When using it in a Boolean context(for example,ifpd.isna(..):..) be sure to only pass a scalar to it.
“setitem” operations will now raise an error for non-string data#
With the new string dtype, any attempt to set a non-string value in a Series orDataFrame will raise an error:
>>>ser=pd.Series(["a","b",None],dtype="str")>>>ser[1]=2.5---------------------------------------------------------------------------TypeError Traceback (most recent call last)...TypeError: Invalid value '2.5' for dtype 'str'. Value should be a string or missing value, got 'float' instead.
If you relied on the flexible nature of object dtype being able to hold anyPython object, but your initial data was inferred as strings, your code might beimpacted by this change.
How to write compatible code?
You can update your code to ensure you only set string values in such columns,or otherwise you can explicitly ensure the column has object dtype first. Thiscan be done by specifying the dtype explicitly in the constructor, or by usingtheastype() method:
>>>ser=pd.Series(["a","b",None],dtype="str")>>>ser=ser.astype("object")>>>ser[1]=2.5
Thisastype("object") call will be redundant when using pandas 2.x, butthis code will work for all versions.
Invalid unicode input#
Python allows to have a built-instr object that represents invalid unicodedata. And since theobject dtype can hold any Python object, you can have apandas Series with such invalid unicode data:
>>>ser=pd.Series(["\u2600","\ud83d"],dtype=object)>>>ser0 ☀1 \ud83ddtype: object
However, when using the string dtype usingpyarrow under the hood, this canonly store valid unicode data, and otherwise it will raise an error:
>>>ser=pd.Series(["\u2600","\ud83d"])---------------------------------------------------------------------------UnicodeEncodeError Traceback (most recent call last)...UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
If you want to keep the previous behaviour, you can explicitly specifydtype=object to keep working with object dtype.
When you have byte data that you want to convert to strings usingdecode(),thedecode() method now has adtype parameter to beable to specify object dtype instead of the default of string dtype for this usecase.
Series.values() now returns anExtensionArray#
With object dtype, using.values on a Series will return the underlying NumPy array.
>>>ser=pd.Series(["a","b",np.nan],dtype="object")>>>type(ser.values)<class 'numpy.ndarray'>
However with the new string dtype, the underlying ExtensionArray is returned instead.
>>>ser=pd.Series(["a","b",pd.NA],dtype="str")>>>ser.values<ArrowStringArray>['a', 'b', nan]Length: 3, dtype: str
If your code requires a NumPy array, you should useSeries.to_numpy().
>>>ser=pd.Series(["a","b",pd.NA],dtype="str")>>>ser.to_numpy()['a' 'b' nan]
In general, you should always preferSeries.to_numpy() to get a NumPy array orSeries.array() to get an ExtensionArray over usingSeries.values().
Notable bug fixes#
astype(str) preserving missing values#
The stringifying of missing values is a long standing “bug” or misfeature, asdiscussed inpandas-dev/pandas#25353, but fixing itintroduces a significant behaviour change.
With pandas < 3, when usingastype(str) orastype("str"), the operationwould convert every element to a string, including the missing values:
# OLD behavior in pandas < 3>>>ser=pd.Series([1.5,np.nan])>>>ser01.51NaNdtype:float64>>>ser.astype("str")01.51nandtype:object>>>ser.astype("str").to_numpy()array(['1.5','nan'],dtype=object)
Note howNaN (np.nan) was converted to the string"nan". This wasnot the intended behavior, and it was inconsistent with how other dtypes handledmissing values.
With pandas 3, this behavior has been fixed, and nowastype("str") will castto the new string dtype, which preserves the missing values:
# NEW behavior in pandas 3>>>pd.options.future.infer_string=True>>>ser=pd.Series([1.5,np.nan])>>>ser.astype("str")01.51NaNdtype:str>>>ser.astype("str").to_numpy()array(['1.5',nan],dtype=object)
If you want to preserve the old behaviour of converting every object to astring, you can useser.map(str) instead. If you want do such conversionwhile preserving the missing values in a way that works with both pandas 2.x and3.x, you can useser.map(str,na_action="ignore") (for pandas 3.x only, youcan doser.astype("str")).
If you want to convert to object or string dtype for pandas 2.x and 3.x,respectively, without needing to stringify each individual element, you willhave to use a conditional check on the pandas version.For example, to convert a categorical Series with string categories to itsdense non-categorical version with object or string dtype:
>>>importpandasaspd>>>ser=pd.Series(["a",np.nan],dtype="category")>>>ser.astype(objectifpd.__version__<"3"else"str")
prod() raising for string data#
In pandas < 3, calling theprod() method on a Series withstring data would generally raise an error, except when the Series was empty orcontained only a single string (potentially with missing values):
>>>ser=pd.Series(["a",None],dtype=object)>>>ser.prod()'a'
When the Series contains multiple strings, it will raise aTypeError. Thisbehaviour stays the same in pandas 3 when using the flexibleobject dtype.But by virtue of using the new string dtype, this will generally consistentlyraise an error regardless of the number of strings:
>>>ser=pd.Series(["a",None],dtype="str")>>>ser.prod()---------------------------------------------------------------------------TypeError Traceback (most recent call last)...TypeError: Cannot perform reduction 'prod' with string dtype
- Background
- Brief introduction to the new default string dtype
- Overview of behavior differences and how to address them