- User Guide
- Frequently...
Frequently Asked Questions (FAQ)#
DataFrame memory usage#
The memory usage of aDataFrame
(including the index) is shown when callingtheinfo()
. A configuration option,display.memory_usage
(seethe list of options), specifies if theDataFrame
memory usage will be displayed when invoking theinfo()
method.
For example, the memory usage of theDataFrame
below is shownwhen callinginfo()
:
In [1]:dtypes=[ ...:"int64", ...:"float64", ...:"datetime64[ns]", ...:"timedelta64[ns]", ...:"complex128", ...:"object", ...:"bool", ...:] ...:In [2]:n=5000In [3]:data={t:np.random.randint(100,size=n).astype(t)fortindtypes}In [4]:df=pd.DataFrame(data)In [5]:df["categorical"]=df["object"].astype("category")In [6]:df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 5000 entries, 0 to 4999Data columns (total 8 columns): # Column Non-Null Count Dtype--- ------ -------------- ----- 0 int64 5000 non-null int64 1 float64 5000 non-null float64 2 datetime64[ns] 5000 non-null datetime64[ns] 3 timedelta64[ns] 5000 non-null timedelta64[ns] 4 complex128 5000 non-null complex128 5 object 5000 non-null object 6 bool 5000 non-null bool 7 categorical 5000 non-null categorydtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)memory usage: 288.2+ KB
The+
symbol indicates that the true memory usage could be higher, becausepandas does not count the memory used by values in columns withdtype=object
.
Passingmemory_usage='deep'
will enable a more accurate memory usage report,accounting for the full usage of the contained objects. This is optionalas it can be expensive to do this deeper introspection.
In [7]:df.info(memory_usage="deep")<class 'pandas.core.frame.DataFrame'>RangeIndex: 5000 entries, 0 to 4999Data columns (total 8 columns): # Column Non-Null Count Dtype--- ------ -------------- ----- 0 int64 5000 non-null int64 1 float64 5000 non-null float64 2 datetime64[ns] 5000 non-null datetime64[ns] 3 timedelta64[ns] 5000 non-null timedelta64[ns] 4 complex128 5000 non-null complex128 5 object 5000 non-null object 6 bool 5000 non-null bool 7 categorical 5000 non-null categorydtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)memory usage: 424.7 KB
By default the display option is set toTrue
but can be explicitlyoverridden by passing thememory_usage
argument when invokinginfo()
.
The memory usage of each column can be found by calling thememory_usage()
method. This returns aSeries
with an indexrepresented by column names and memory usage of each column shown in bytes. FortheDataFrame
above, the memory usage of each column and the total memoryusage can be found with thememory_usage()
method:
In [8]:df.memory_usage()Out[8]:Index 128int64 40000float64 40000datetime64[ns] 40000timedelta64[ns] 40000complex128 80000object 40000bool 5000categorical 9968dtype: int64# total memory usage of dataframeIn [9]:df.memory_usage().sum()Out[9]:295096
By default the memory usage of theDataFrame
index is shown in thereturnedSeries
, the memory usage of the index can be suppressed by passingtheindex=False
argument:
In [10]:df.memory_usage(index=False)Out[10]:int64 40000float64 40000datetime64[ns] 40000timedelta64[ns] 40000complex128 80000object 40000bool 5000categorical 9968dtype: int64
The memory usage displayed by theinfo()
method utilizes thememory_usage()
method to determine the memory usage of aDataFrame
while also formatting the output in human-readable units (base-2representation; i.e. 1KB = 1024 bytes).
See alsoCategorical Memory Usage.
Using if/truth statements with pandas#
pandas follows the NumPy convention of raising an error when you try to convertsomething to abool
. This happens in anif
-statement or when using theboolean operations:and
,or
, andnot
. It is not clear what the resultof the following code should be:
>>>ifpd.Series([False,True,False]):...pass
Should it beTrue
because it’s not zero-length, orFalse
because thereareFalse
values? It is unclear, so instead, pandas raises aValueError
:
In [11]:ifpd.Series([False,True,False]): ....:print("I was true") ....:---------------------------------------------------------------------------ValueErrorTraceback (most recent call last)<ipython-input-11-5c782b38cd2f> in?()---->1ifpd.Series([False,True,False]):2print("I was true")~/work/pandas/pandas/pandas/core/generic.py in?(self)1575@final1576def__nonzero__(self)->NoReturn:->1577raiseValueError(1578f"The truth value of a{type(self).__name__} is ambiguous. "1579"Use a.empty, a.bool(), a.item(), a.any() or a.all()."1580)ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You need to explicitly choose what you want to do with theDataFrame
, e.g.useany()
,all()
orempty()
.Alternatively, you might want to compare if the pandas object isNone
:
In [12]:ifpd.Series([False,True,False])isnotNone: ....:print("I was not None") ....:I was not None
Below is how to check if any of the values areTrue
:
In [13]:ifpd.Series([False,True,False]).any(): ....:print("I am any") ....:I am any
Bitwise boolean#
Bitwise boolean operators like==
and!=
return a booleanSeries
which performs an element-wise comparison when compared to a scalar.
In [14]:s=pd.Series(range(5))In [15]:s==4Out[15]:0 False1 False2 False3 False4 Truedtype: bool
Seeboolean comparisons for more examples.
Using thein
operator#
Using the Pythonin
operator on aSeries
tests for membership in theindex, not membership among the values.
In [16]:s=pd.Series(range(5),index=list("abcde"))In [17]:2insOut[17]:FalseIn [18]:'b'insOut[18]:True
If this behavior is surprising, keep in mind that usingin
on a Pythondictionary tests keys, not values, andSeries
are dict-like.To test for membership in the values, use the methodisin()
:
In [19]:s.isin([2])Out[19]:a Falseb Falsec Trued Falsee Falsedtype: boolIn [20]:s.isin([2]).any()Out[20]:True
ForDataFrame
, likewise,in
applies to the column axis,testing for membership in the list of column names.
Mutating with User Defined Function (UDF) methods#
This section applies to pandas methods that take a UDF. In particular, the methodsDataFrame.apply()
,DataFrame.aggregate()
,DataFrame.transform()
, andDataFrame.filter()
.
It is a general rule in programming that one should not mutate a containerwhile it is being iterated over. Mutation will invalidate the iterator,causing unexpected behavior. Consider the example:
In [21]:values=[0,1,2,3,4,5]In [22]:n_removed=0In [23]:fork,valueinenumerate(values): ....:idx=k-n_removed ....:ifvalue%2==1: ....:delvalues[idx] ....:n_removed+=1 ....:else: ....:values[idx]=value+1 ....:In [24]:valuesOut[24]:[1, 4, 5]
One probably would have expected that the result would be[1,3,5]
.When using a pandas method that takes a UDF, internally pandas is ofteniterating over theDataFrame
or other pandas object. Therefore, if the UDF mutates (changes)theDataFrame
, unexpected behavior can arise.
Here is a similar example withDataFrame.apply()
:
In [25]:deff(s): ....:s.pop("a") ....:returns ....:In [26]:df=pd.DataFrame({"a":[1,2,3],"b":[4,5,6]})In [27]:df.apply(f,axis="columns")---------------------------------------------------------------------------KeyErrorTraceback (most recent call last)File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, inIndex.get_loc(self, key)3804try:->3805returnself._engine.get_loc(casted_key)3806exceptKeyErroraserr:File index.pyx:167, inpandas._libs.index.IndexEngine.get_loc()File index.pyx:196, inpandas._libs.index.IndexEngine.get_loc()File pandas/_libs/hashtable_class_helper.pxi:7081, inpandas._libs.hashtable.PyObjectHashTable.get_item()File pandas/_libs/hashtable_class_helper.pxi:7089, inpandas._libs.hashtable.PyObjectHashTable.get_item()KeyError: 'a'Theaboveexceptionwasthedirectcauseofthefollowingexception:KeyErrorTraceback (most recent call last)CellIn[27],line1---->1df.apply(f,axis="columns")File ~/work/pandas/pandas/pandas/core/frame.py:10374, inDataFrame.apply(self, func, axis, raw, result_type, args, by_row, engine, engine_kwargs, **kwargs)10360frompandas.core.applyimportframe_apply10362op=frame_apply(10363self,10364func=func,(...)10372kwargs=kwargs,10373)>10374returnop.apply().__finalize__(self,method="apply")File ~/work/pandas/pandas/pandas/core/apply.py:916, inFrameApply.apply(self)913elifself.raw:914returnself.apply_raw(engine=self.engine,engine_kwargs=self.engine_kwargs)-->916returnself.apply_standard()File ~/work/pandas/pandas/pandas/core/apply.py:1063, inFrameApply.apply_standard(self)1061defapply_standard(self):1062ifself.engine=="python":->1063results,res_index=self.apply_series_generator()1064else:1065results,res_index=self.apply_series_numba()File ~/work/pandas/pandas/pandas/core/apply.py:1081, inFrameApply.apply_series_generator(self)1078withoption_context("mode.chained_assignment",None):1079fori,vinenumerate(series_gen):1080# ignore SettingWithCopy here in case the user mutates->1081results[i]=self.func(v,*self.args,**self.kwargs)1082ifisinstance(results[i],ABCSeries):1083# If we have a view on v, we need to make a copy because1084# series_generator will swap out the underlying data1085results[i]=results[i].copy(deep=False)Cell In[25], line 2, inf(s)1deff(s):---->2s.pop("a")3returnsFile ~/work/pandas/pandas/pandas/core/series.py:5391, inSeries.pop(self, item)5366defpop(self,item:Hashable)->Any:5367"""5368 Return item and drops from series. Raise KeyError if not found.5369 (...)5389 dtype: int645390 """->5391returnsuper().pop(item=item)File ~/work/pandas/pandas/pandas/core/generic.py:947, inNDFrame.pop(self, item)946defpop(self,item:Hashable)->Series|Any:-->947result=self[item]948delself[item]950returnresultFile ~/work/pandas/pandas/pandas/core/series.py:1121, inSeries.__getitem__(self, key)1118returnself._values[key]1120elifkey_is_scalar:->1121returnself._get_value(key)1123# Convert generator to list before going through hashable part1124# (We will iterate through the generator there to check for slices)1125ifis_iterator(key):File ~/work/pandas/pandas/pandas/core/series.py:1237, inSeries._get_value(self, label, takeable)1234returnself._values[label]1236# Similar to Index.get_value, but we do not fall back to positional->1237loc=self.index.get_loc(label)1239ifis_integer(loc):1240returnself._values[loc]File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, inIndex.get_loc(self, key)3807ifisinstance(casted_key,slice)or(3808isinstance(casted_key,abc.Iterable)3809andany(isinstance(x,slice)forxincasted_key)3810):3811raiseInvalidIndexError(key)->3812raiseKeyError(key)fromerr3813exceptTypeError:3814# If we have a listlike key, _check_indexing_error will raise3815# InvalidIndexError. Otherwise we fall through and re-raise3816# the TypeError.3817self._check_indexing_error(key)KeyError: 'a'
To resolve this issue, one can make a copy so that the mutation doesnot apply to the container being iterated over.
In [28]:values=[0,1,2,3,4,5]In [29]:n_removed=0In [30]:fork,valueinenumerate(values.copy()): ....:idx=k-n_removed ....:ifvalue%2==1: ....:delvalues[idx] ....:n_removed+=1 ....:else: ....:values[idx]=value+1 ....:In [31]:valuesOut[31]:[1, 3, 5]
In [32]:deff(s): ....:s=s.copy() ....:s.pop("a") ....:returns ....:In [33]:df=pd.DataFrame({"a":[1,2,3],'b':[4,5,6]})In [34]:df.apply(f,axis="columns")Out[34]: b0 41 52 6
Missing value representation for NumPy types#
np.nan
as theNA
representation for NumPy types#
For lack ofNA
(missing) support from the ground up in NumPy and Python ingeneral,NA
could have been represented with:
Amasked array solution: an array of data and an array of boolean valuesindicating whether a value is there or is missing.
Using a special sentinel value, bit pattern, or set of sentinel values todenote
NA
across the dtypes.
The special valuenp.nan
(Not-A-Number) was chosen as theNA
value for NumPy types, and there are APIfunctions likeDataFrame.isna()
andDataFrame.notna()
which can be used across the dtypes todetect NA values. However, this choice has a downside of coercing missing integer data as float types asshown inSupport for integer NA.
NA
type promotions for NumPy types#
When introducing NAs into an existingSeries
orDataFrame
viareindex()
or some other means, boolean and integer types will bepromoted to a different dtype in order to store the NAs. The promotions aresummarized in this table:
Typeclass | Promotion dtype for storing NAs |
---|---|
| no change |
| no change |
| cast to |
| cast to |
Support for integerNA
#
In the absence of high performanceNA
support being built into NumPy fromthe ground up, the primary casualty is the ability to represent NAs in integerarrays. For example:
In [35]:s=pd.Series([1,2,3,4,5],index=list("abcde"))In [36]:sOut[36]:a 1b 2c 3d 4e 5dtype: int64In [37]:s.dtypeOut[37]:dtype('int64')In [38]:s2=s.reindex(["a","b","c","f","u"])In [39]:s2Out[39]:a 1.0b 2.0c 3.0f NaNu NaNdtype: float64In [40]:s2.dtypeOut[40]:dtype('float64')
This trade-off is made largely for memory and performance reasons, and also sothat the resultingSeries
continues to be “numeric”.
If you need to represent integers with possibly missing values, use one ofthe nullable-integer extension dtypes provided by pandas or pyarrow
In [41]:s_int=pd.Series([1,2,3,4,5],index=list("abcde"),dtype=pd.Int64Dtype())In [42]:s_intOut[42]:a 1b 2c 3d 4e 5dtype: Int64In [43]:s_int.dtypeOut[43]:Int64Dtype()In [44]:s2_int=s_int.reindex(["a","b","c","f","u"])In [45]:s2_intOut[45]:a 1b 2c 3f <NA>u <NA>dtype: Int64In [46]:s2_int.dtypeOut[46]:Int64Dtype()In [47]:s_int_pa=pd.Series([1,2,None],dtype="int64[pyarrow]")In [48]:s_int_paOut[48]:0 11 22 <NA>dtype: int64[pyarrow]
SeeNullable integer data type andPyArrow Functionality for more.
Why not make NumPy like R?#
Many people have suggested that NumPy should simply emulate theNA
supportpresent in the more domain-specific statistical programming languageR. Part of the reason is theNumPy type hierarchy.
The R language, by contrast, only has a handful of built-in data types:integer
,numeric
(floating-point),character
, andboolean
.NA
types are implemented by reserving special bit patterns foreach type to be used as the missing value. While doing this with the full NumPytype hierarchy would be possible, it would be a more substantial trade-off(especially for the 8- and 16-bit data types) and implementation undertaking.
However, RNA
semantics are now available by using masked NumPy types such asInt64Dtype
or PyArrow types (ArrowDtype
).
Differences with NumPy#
ForSeries
andDataFrame
objects,var()
normalizes byN-1
to produceunbiased estimates of the population variance, while NumPy’snumpy.var()
normalizes by N, which measures the variance of the sample. Note thatcov()
normalizes byN-1
in both pandas and NumPy.
Thread-safety#
pandas is not 100% thread safe. The known issues relate tothecopy()
method. If you are doing a lot of copying ofDataFrame
objects shared among threads, we recommend holding locks insidethe threads where the data copying occurs.
Seethis linkfor more information.
Byte-ordering issues#
Occasionally you may have to deal with data that were created on a machine witha different byte order than the one on which you are running Python. A commonsymptom of this issue is an error like:
Traceback...ValueError:Big-endianbuffernotsupportedonlittle-endiancompiler
To dealwith this issue you should convert the underlying NumPy array to the nativesystem byte orderbefore passing it toSeries
orDataFrame
constructors using something similar to the following:
In [49]:x=np.array(list(range(10)),">i4")# big endianIn [50]:newx=x.byteswap().view(x.dtype.newbyteorder())# force native byteorderIn [51]:s=pd.Series(newx)
Seethe NumPy documentation on byte order for moredetails.