Frequently Asked Questions (FAQ)#

DataFrame memory usage#

The memory usage of aDataFrame (including the index) is shown when callingtheinfo(). A configuration option,display.memory_usage(seethe list of options), specifies if theDataFrame memory usage will be displayed when invoking theinfo()method.

For example, the memory usage of theDataFrame below is shownwhen callinginfo():

In [1]:dtypes=[   ...:"int64",   ...:"float64",   ...:"datetime64[ns]",   ...:"timedelta64[ns]",   ...:"complex128",   ...:"object",   ...:"bool",   ...:]   ...:In [2]:n=5000In [3]:data={t:np.random.randint(100,size=n).astype(t)fortindtypes}In [4]:df=pd.DataFrame(data)In [5]:df["categorical"]=df["object"].astype("category")In [6]:df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 5000 entries, 0 to 4999Data columns (total 8 columns): #   Column           Non-Null Count  Dtype---  ------           --------------  ----- 0   int64            5000 non-null   int64 1   float64          5000 non-null   float64 2   datetime64[ns]   5000 non-null   datetime64[ns] 3   timedelta64[ns]  5000 non-null   timedelta64[ns] 4   complex128       5000 non-null   complex128 5   object           5000 non-null   object 6   bool             5000 non-null   bool 7   categorical      5000 non-null   categorydtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)memory usage: 288.2+ KB

The+ symbol indicates that the true memory usage could be higher, becausepandas does not count the memory used by values in columns withdtype=object.

Passingmemory_usage='deep' will enable a more accurate memory usage report,accounting for the full usage of the contained objects. This is optionalas it can be expensive to do this deeper introspection.

In [7]:df.info(memory_usage="deep")<class 'pandas.core.frame.DataFrame'>RangeIndex: 5000 entries, 0 to 4999Data columns (total 8 columns): #   Column           Non-Null Count  Dtype---  ------           --------------  ----- 0   int64            5000 non-null   int64 1   float64          5000 non-null   float64 2   datetime64[ns]   5000 non-null   datetime64[ns] 3   timedelta64[ns]  5000 non-null   timedelta64[ns] 4   complex128       5000 non-null   complex128 5   object           5000 non-null   object 6   bool             5000 non-null   bool 7   categorical      5000 non-null   categorydtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)memory usage: 424.7 KB

By default the display option is set toTrue but can be explicitlyoverridden by passing thememory_usage argument when invokinginfo().

The memory usage of each column can be found by calling thememory_usage() method. This returns aSeries with an indexrepresented by column names and memory usage of each column shown in bytes. FortheDataFrame above, the memory usage of each column and the total memoryusage can be found with thememory_usage() method:

In [8]:df.memory_usage()Out[8]:Index                128int64              40000float64            40000datetime64[ns]     40000timedelta64[ns]    40000complex128         80000object             40000bool                5000categorical         9968dtype: int64# total memory usage of dataframeIn [9]:df.memory_usage().sum()Out[9]:295096

By default the memory usage of theDataFrame index is shown in thereturnedSeries, the memory usage of the index can be suppressed by passingtheindex=False argument:

In [10]:df.memory_usage(index=False)Out[10]:int64              40000float64            40000datetime64[ns]     40000timedelta64[ns]    40000complex128         80000object             40000bool                5000categorical         9968dtype: int64

The memory usage displayed by theinfo() method utilizes thememory_usage() method to determine the memory usage of aDataFrame while also formatting the output in human-readable units (base-2representation; i.e. 1KB = 1024 bytes).

Using if/truth statements with pandas#

pandas follows the NumPy convention of raising an error when you try to convertsomething to abool. This happens in anif-statement or when using theboolean operations:and,or, andnot. It is not clear what the resultof the following code should be:

>>>ifpd.Series([False,True,False]):...pass

Should it beTrue because it’s not zero-length, orFalse because thereareFalse values? It is unclear, so instead, pandas raises aValueError:

In [11]:ifpd.Series([False,True,False]):   ....:print("I was true")   ....:---------------------------------------------------------------------------ValueErrorTraceback (most recent call last)<ipython-input-11-5c782b38cd2f> in?()---->1ifpd.Series([False,True,False]):2print("I was true")~/work/pandas/pandas/pandas/core/generic.py in?(self)1575@final1576def__nonzero__(self)->NoReturn:->1577raiseValueError(1578f"The truth value of a{type(self).__name__} is ambiguous. "1579"Use a.empty, a.bool(), a.item(), a.any() or a.all()."1580)ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

You need to explicitly choose what you want to do with theDataFrame, e.g.useany(),all() orempty().Alternatively, you might want to compare if the pandas object isNone:

In [12]:ifpd.Series([False,True,False])isnotNone:   ....:print("I was not None")   ....:I was not None

Below is how to check if any of the values areTrue:

In [13]:ifpd.Series([False,True,False]).any():   ....:print("I am any")   ....:I am any

Bitwise boolean#

Bitwise boolean operators like== and!= return a booleanSerieswhich performs an element-wise comparison when compared to a scalar.

In [14]:s=pd.Series(range(5))In [15]:s==4Out[15]:0    False1    False2    False3    False4     Truedtype: bool

Seeboolean comparisons for more examples.

Using the`in` operator#

Using the Pythonin operator on aSeries tests for membership in theindex, not membership among the values.

In [16]:s=pd.Series(range(5),index=list("abcde"))In [17]:2insOut[17]:FalseIn [18]:'b'insOut[18]:True

If this behavior is surprising, keep in mind that usingin on a Pythondictionary tests keys, not values, andSeries are dict-like.To test for membership in the values, use the methodisin():

In [19]:s.isin([2])Out[19]:a    Falseb    Falsec     Trued    Falsee    Falsedtype: boolIn [20]:s.isin([2]).any()Out[20]:True

ForDataFrame, likewise,in applies to the column axis,testing for membership in the list of column names.

Mutating with User Defined Function (UDF) methods#

This section applies to pandas methods that take a UDF. In particular, the methodsDataFrame.apply(),DataFrame.aggregate(),DataFrame.transform(), andDataFrame.filter().

It is a general rule in programming that one should not mutate a containerwhile it is being iterated over. Mutation will invalidate the iterator,causing unexpected behavior. Consider the example:

In [21]:values=[0,1,2,3,4,5]In [22]:n_removed=0In [23]:fork,valueinenumerate(values):   ....:idx=k-n_removed   ....:ifvalue%2==1:   ....:delvalues[idx]   ....:n_removed+=1   ....:else:   ....:values[idx]=value+1   ....:In [24]:valuesOut[24]:[1, 4, 5]

One probably would have expected that the result would be[1,3,5].When using a pandas method that takes a UDF, internally pandas is ofteniterating over theDataFrame or other pandas object. Therefore, if the UDF mutates (changes)theDataFrame, unexpected behavior can arise.

Here is a similar example withDataFrame.apply():

In [25]:deff(s):   ....:s.pop("a")   ....:returns   ....:In [26]:df=pd.DataFrame({"a":[1,2,3],"b":[4,5,6]})In [27]:df.apply(f,axis="columns")---------------------------------------------------------------------------KeyErrorTraceback (most recent call last)File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, inIndex.get_loc(self, key)3811try:->3812returnself._engine.get_loc(casted_key)3813exceptKeyErroraserr:File ~/work/pandas/pandas/pandas/_libs/index.pyx:167, inpandas._libs.index.IndexEngine.get_loc()File ~/work/pandas/pandas/pandas/_libs/index.pyx:196, inpandas._libs.index.IndexEngine.get_loc()File pandas/_libs/hashtable_class_helper.pxi:7088, inpandas._libs.hashtable.PyObjectHashTable.get_item()File pandas/_libs/hashtable_class_helper.pxi:7096, inpandas._libs.hashtable.PyObjectHashTable.get_item()KeyError: 'a'Theaboveexceptionwasthedirectcauseofthefollowingexception:KeyErrorTraceback (most recent call last)CellIn[27],line1---->1df.apply(f,axis="columns")File ~/work/pandas/pandas/pandas/core/frame.py:10381, inDataFrame.apply(self, func, axis, raw, result_type, args, by_row, engine, engine_kwargs, **kwargs)10367frompandas.core.applyimportframe_apply10369op=frame_apply(10370self,10371func=func,(...)10379kwargs=kwargs,10380)>10381returnop.apply().__finalize__(self,method="apply")File ~/work/pandas/pandas/pandas/core/apply.py:916, inFrameApply.apply(self)913elifself.raw:914returnself.apply_raw(engine=self.engine,engine_kwargs=self.engine_kwargs)-->916returnself.apply_standard()File ~/work/pandas/pandas/pandas/core/apply.py:1063, inFrameApply.apply_standard(self)1061defapply_standard(self):1062ifself.engine=="python":->1063results,res_index=self.apply_series_generator()1064else:1065results,res_index=self.apply_series_numba()File ~/work/pandas/pandas/pandas/core/apply.py:1081, inFrameApply.apply_series_generator(self)1078withoption_context("mode.chained_assignment",None):1079fori,vinenumerate(series_gen):1080# ignore SettingWithCopy here in case the user mutates->1081results[i]=self.func(v,*self.args,**self.kwargs)1082ifisinstance(results[i],ABCSeries):1083# If we have a view on v, we need to make a copy because1084#  series_generator will swap out the underlying data1085results[i]=results[i].copy(deep=False)Cell In[25], line 2, inf(s)1deff(s):---->2s.pop("a")3returnsFile ~/work/pandas/pandas/pandas/core/series.py:5402, inSeries.pop(self, item)5377defpop(self,item:Hashable)->Any:5378"""5379     Return item and drops from series. Raise KeyError if not found.5380   (...)5400     dtype: int645401     """->5402returnsuper().pop(item=item)File ~/work/pandas/pandas/pandas/core/generic.py:947, inNDFrame.pop(self, item)946defpop(self,item:Hashable)->Series|Any:-->947result=self[item]948delself[item]950returnresultFile ~/work/pandas/pandas/pandas/core/series.py:1130, inSeries.__getitem__(self, key)1127returnself._values[key]1129elifkey_is_scalar:->1130returnself._get_value(key)1132# Convert generator to list before going through hashable part1133# (We will iterate through the generator there to check for slices)1134ifis_iterator(key):File ~/work/pandas/pandas/pandas/core/series.py:1246, inSeries._get_value(self, label, takeable)1243returnself._values[label]1245# Similar to Index.get_value, but we do not fall back to positional->1246loc=self.index.get_loc(label)1248ifis_integer(loc):1249returnself._values[loc]File ~/work/pandas/pandas/pandas/core/indexes/base.py:3819, inIndex.get_loc(self, key)3814ifisinstance(casted_key,slice)or(3815isinstance(casted_key,abc.Iterable)3816andany(isinstance(x,slice)forxincasted_key)3817):3818raiseInvalidIndexError(key)->3819raiseKeyError(key)fromerr3820exceptTypeError:3821# If we have a listlike key, _check_indexing_error will raise3822#  InvalidIndexError. Otherwise we fall through and re-raise3823#  the TypeError.3824self._check_indexing_error(key)KeyError: 'a'

To resolve this issue, one can make a copy so that the mutation doesnot apply to the container being iterated over.

In [28]:values=[0,1,2,3,4,5]In [29]:n_removed=0In [30]:fork,valueinenumerate(values.copy()):   ....:idx=k-n_removed   ....:ifvalue%2==1:   ....:delvalues[idx]   ....:n_removed+=1   ....:else:   ....:values[idx]=value+1   ....:In [31]:valuesOut[31]:[1, 3, 5]

In [32]:deff(s):   ....:s=s.copy()   ....:s.pop("a")   ....:returns   ....:In [33]:df=pd.DataFrame({"a":[1,2,3],'b':[4,5,6]})In [34]:df.apply(f,axis="columns")Out[34]:   b0  41  52  6

Missing value representation for NumPy types#

`np.nan` as the`NA` representation for NumPy types#

For lack ofNA (missing) support from the ground up in NumPy and Python ingeneral,NA could have been represented with:

Amasked array solution: an array of data and an array of boolean valuesindicating whether a value is there or is missing.
Using a special sentinel value, bit pattern, or set of sentinel values todenoteNA across the dtypes.

The special valuenp.nan (Not-A-Number) was chosen as theNA value for NumPy types, and there are APIfunctions likeDataFrame.isna() andDataFrame.notna() which can be used across the dtypes todetect NA values. However, this choice has a downside of coercing missing integer data as float types asshown inSupport for integer NA.

`NA` type promotions for NumPy types#

When introducing NAs into an existingSeries orDataFrame viareindex() or some other means, boolean and integer types will bepromoted to a different dtype in order to store the NAs. The promotions aresummarized in this table:

Typeclass	Promotion dtype for storing NAs
`floating`	no change
`object`	no change
`integer`	cast to`float64`
`boolean`	cast to`object`

Support for integer`NA`#

In the absence of high performanceNA support being built into NumPy fromthe ground up, the primary casualty is the ability to represent NAs in integerarrays. For example:

In [35]:s=pd.Series([1,2,3,4,5],index=list("abcde"))In [36]:sOut[36]:a    1b    2c    3d    4e    5dtype: int64In [37]:s.dtypeOut[37]:dtype('int64')In [38]:s2=s.reindex(["a","b","c","f","u"])In [39]:s2Out[39]:a    1.0b    2.0c    3.0f    NaNu    NaNdtype: float64In [40]:s2.dtypeOut[40]:dtype('float64')

This trade-off is made largely for memory and performance reasons, and also sothat the resultingSeries continues to be “numeric”.

If you need to represent integers with possibly missing values, use one ofthe nullable-integer extension dtypes provided by pandas or pyarrow

In [41]:s_int=pd.Series([1,2,3,4,5],index=list("abcde"),dtype=pd.Int64Dtype())In [42]:s_intOut[42]:a    1b    2c    3d    4e    5dtype: Int64In [43]:s_int.dtypeOut[43]:Int64Dtype()In [44]:s2_int=s_int.reindex(["a","b","c","f","u"])In [45]:s2_intOut[45]:a       1b       2c       3f    <NA>u    <NA>dtype: Int64In [46]:s2_int.dtypeOut[46]:Int64Dtype()In [47]:s_int_pa=pd.Series([1,2,None],dtype="int64[pyarrow]")In [48]:s_int_paOut[48]:0       11       22    <NA>dtype: int64[pyarrow]

SeeNullable integer data type andPyArrow Functionality for more.

Why not make NumPy like R?#

Many people have suggested that NumPy should simply emulate theNA supportpresent in the more domain-specific statistical programming languageR. Part of the reason is theNumPy type hierarchy.

The R language, by contrast, only has a handful of built-in data types:integer,numeric (floating-point),character, andboolean.NA types are implemented by reserving special bit patterns foreach type to be used as the missing value. While doing this with the full NumPytype hierarchy would be possible, it would be a more substantial trade-off(especially for the 8- and 16-bit data types) and implementation undertaking.

However, RNA semantics are now available by using masked NumPy types such asInt64Dtypeor PyArrow types (ArrowDtype).

Differences with NumPy#

ForSeries andDataFrame objects,var() normalizes byN-1 to produceunbiased estimates of the population variance, while NumPy’snumpy.var() normalizes by N, which measures the variance of the sample. Note thatcov() normalizes byN-1 in both pandas and NumPy.

Thread-safety#

pandas is not 100% thread safe. The known issues relate tothecopy() method. If you are doing a lot of copying ofDataFrame objects shared among threads, we recommend holding locks insidethe threads where the data copying occurs.

Seethis linkfor more information.

Byte-ordering issues#

Occasionally you may have to deal with data that were created on a machine witha different byte order than the one on which you are running Python. A commonsymptom of this issue is an error like:

Traceback...ValueError:Big-endianbuffernotsupportedonlittle-endiancompiler

To dealwith this issue you should convert the underlying NumPy array to the nativesystem byte orderbefore passing it toSeries orDataFrameconstructors using something similar to the following:

In [49]:x=np.array(list(range(10)),">i4")# big endianIn [50]:newx=x.byteswap().view(x.dtype.newbyteorder())# force native byteorderIn [51]:s=pd.Series(newx)

Seethe NumPy documentation on byte order for moredetails.

On this page

Show Source

Movatterモバイル変換