- User Guide
- Nullable...
Nullable integer data type#
Note
IntegerArray is currently experimental. Its API or implementation maychange without warning. Usespandas.NA
as the missing value.
InWorking with missing data, we saw that pandas primarily usesNaN
to representmissing data. BecauseNaN
is a float, this forces an array of integers withany missing values to become floating point. In some cases, this may not mattermuch. But if your integer column is, say, an identifier, casting to float canbe problematic. Some integers cannot even be represented as floating pointnumbers.
Construction#
pandas can represent integer data with possibly missing values usingarrays.IntegerArray
. This is anextension typeimplemented within pandas.
In [1]:arr=pd.array([1,2,None],dtype=pd.Int64Dtype())In [2]:arrOut[2]:<IntegerArray>[1, 2, <NA>]Length: 3, dtype: Int64
Or the string alias"Int64"
(note the capital"I"
) to differentiate fromNumPy’s'int64'
dtype:
In [3]:pd.array([1,2,np.nan],dtype="Int64")Out[3]:<IntegerArray>[1, 2, <NA>]Length: 3, dtype: Int64
All NA-like values are replaced withpandas.NA
.
In [4]:pd.array([1,2,np.nan,None,pd.NA],dtype="Int64")Out[4]:<IntegerArray>[1, 2, <NA>, <NA>, <NA>]Length: 5, dtype: Int64
This array can be stored in aDataFrame
orSeries
like anyNumPy array.
In [5]:pd.Series(arr)Out[5]:0 11 22 <NA>dtype: Int64
You can also pass the list-like object to theSeries
constructorwith the dtype.
Warning
Currentlypandas.array()
andpandas.Series()
use differentrules for dtype inference.pandas.array()
will infer anullable-integer dtype
In [6]:pd.array([1,None])Out[6]:<IntegerArray>[1, <NA>]Length: 2, dtype: Int64In [7]:pd.array([1,2])Out[7]:<IntegerArray>[1, 2]Length: 2, dtype: Int64
For backwards-compatibility,Series
infers these as eitherinteger or float dtype.
In [8]:pd.Series([1,None])Out[8]:0 1.01 NaNdtype: float64In [9]:pd.Series([1,2])Out[9]:0 11 2dtype: int64
We recommend explicitly providing the dtype to avoid confusion.
In [10]:pd.array([1,None],dtype="Int64")Out[10]:<IntegerArray>[1, <NA>]Length: 2, dtype: Int64In [11]:pd.Series([1,None],dtype="Int64")Out[11]:0 11 <NA>dtype: Int64
In the future, we may provide an option forSeries
to infer anullable-integer dtype.
Operations#
Operations involving an integer array will behave similar to NumPy arrays.Missing values will be propagated, and the data will be coerced to anotherdtype if needed.
In [12]:s=pd.Series([1,2,None],dtype="Int64")# arithmeticIn [13]:s+1Out[13]:0 21 32 <NA>dtype: Int64# comparisonIn [14]:s==1Out[14]:0 True1 False2 <NA>dtype: boolean# slicing operationIn [15]:s.iloc[1:3]Out[15]:1 22 <NA>dtype: Int64# operate with other dtypesIn [16]:s+s.iloc[1:3].astype("Int8")Out[16]:0 <NA>1 42 <NA>dtype: Int64# coerce when neededIn [17]:s+0.01Out[17]:0 1.011 2.012 <NA>dtype: Float64
These dtypes can operate as part of aDataFrame
.
In [18]:df=pd.DataFrame({"A":s,"B":[1,1,3],"C":list("aab")})In [19]:dfOut[19]: A B C0 1 1 a1 2 1 a2 <NA> 3 bIn [20]:df.dtypesOut[20]:A Int64B int64C objectdtype: object
These dtypes can be merged, reshaped & casted.
In [21]:pd.concat([df[["A"]],df[["B","C"]]],axis=1).dtypesOut[21]:A Int64B int64C objectdtype: objectIn [22]:df["A"].astype(float)Out[22]:0 1.01 2.02 NaNName: A, dtype: float64
Reduction and groupby operations such assum()
work as well.
In [23]:df.sum(numeric_only=True)Out[23]:A 3B 5dtype: Int64In [24]:df.sum()Out[24]:A 3B 5C aabdtype: objectIn [25]:df.groupby("B").A.sum()Out[25]:B1 33 0Name: A, dtype: Int64
Scalar NA Value#
arrays.IntegerArray
usespandas.NA
as its scalarmissing value. Slicing a single element that’s missing will returnpandas.NA
In [26]:a=pd.array([1,None],dtype="Int64")In [27]:a[1]Out[27]:<NA>