- User Guide
- Copy-on-Write (CoW)
Copy-on-Write (CoW)#
Note
Copy-on-Write will become the default in pandas 3.0. We recommendturning it on nowto benefit from all improvements.
Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of theoptimizations that become possible through CoW are implemented and supported. All possibleoptimizations are supported starting from pandas 2.1.
CoW will be enabled by default in version 3.0.
CoW will lead to more predictable behavior since it is not possible to update more thanone object with one statement, e.g. indexing operations or methods won’t have side-effects. Additionally, throughdelaying copies as long as possible, the average performance and memory usage will improve.
Previous behavior#
pandas indexing behavior is tricky to understand. Some operations return views whileother return copies. Depending on the result of the operation, mutating one objectmight accidentally mutate another:
In [1]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [2]:subset=df["foo"]In [3]:subset.iloc[0]=100In [4]:dfOut[4]: foo bar0 100 41 2 52 3 6
Mutatingsubset
, e.g. updating its values, also updatesdf
. The exact behavior ishard to predict. Copy-on-Write solves accidentally modifying more than one object,it explicitly disallows this. With CoW enabled,df
is unchanged:
In [5]:pd.options.mode.copy_on_write=TrueIn [6]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [7]:subset=df["foo"]In [8]:subset.iloc[0]=100In [9]:dfOut[9]: foo bar0 1 41 2 52 3 6
The following sections will explain what this means and how it impacts existingapplications.
Migrating to Copy-on-Write#
Copy-on-Write will be the default and only mode in pandas 3.0. This means that usersneed to migrate their code to be compliant with CoW rules.
The default mode in pandas will raise warnings for certain cases that will activelychange behavior and thus change user intended behavior.
We added another mode, e.g.
pd.options.mode.copy_on_write="warn"
that will warn for every operation that will change behavior with CoW. We expect this modeto be very noisy, since many cases that we don’t expect that they will influence users willalso emit a warning. We recommend checking this mode and analyzing the warnings, but it isnot necessary to address all of these warning. The first two items of the following listsare the only cases that need to be addressed to make existing code work with CoW.
The following few items describe the user visible changes:
Chained assignment will never work
loc
should be used as an alternative. Check thechained assignment section for more details.
Accessing the underlying array of a pandas object will return a read-only view
In [10]:ser=pd.Series([1,2,3])In [11]:ser.to_numpy()Out[11]:array([1, 2, 3])
This example returns a NumPy array that is a view of the Series object. This view canbe modified and thus also modify the pandas object. This is not compliant with CoWrules. The returned array is set to non-writeable to protect against this behavior.Creating a copy of this array allows modification. You can also make the arraywriteable again if you don’t care about the pandas object anymore.
See the section aboutread-only NumPy arraysfor more details.
Only one pandas object is updated at once
The following code snippet updates bothdf
andsubset
without CoW:
In [12]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [13]:subset=df["foo"]In [14]:subset.iloc[0]=100In [15]:dfOut[15]: foo bar0 1 41 2 52 3 6
This won’t be possible anymore with CoW, since the CoW rules explicitly forbid this.This includes updating a single column as aSeries
and relying on the changepropagating back to the parentDataFrame
.This statement can be rewritten into a single statement withloc
oriloc
ifthis behavior is necessary.DataFrame.where()
is another suitable alternativefor this case.
Updating a column selected from aDataFrame
with an inplace method willalso not work anymore.
In [16]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [17]:df["foo"].replace(1,5,inplace=True)In [18]:dfOut[18]: foo bar0 1 41 2 52 3 6
This is another form of chained assignment. This can generally be rewritten in 2different forms:
In [19]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [20]:df.replace({"foo":{1:5}},inplace=True)In [21]:dfOut[21]: foo bar0 5 41 2 52 3 6
A different alternative would be to not useinplace
:
In [22]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [23]:df["foo"]=df["foo"].replace(1,5)In [24]:dfOut[24]: foo bar0 5 41 2 52 3 6
Constructors now copy NumPy arrays by default
The Series and DataFrame constructors will now copy NumPy array by default when nototherwise specified. This was changed to avoid mutating a pandas object when theNumPy array is changed inplace outside of pandas. You can setcopy=False
toavoid this copy.
Description#
CoW means that any DataFrame or Series derived from another in any way alwaysbehaves as a copy. As a consequence, we can only change the values of an objectthrough modifying the object itself. CoW disallows updating a DataFrame or a Seriesthat shares data with another DataFrame or Series object inplace.
This avoids side-effects when modifying values and hence, most methods can avoidactually copying the data and only trigger a copy when necessary.
The following example will operate inplace with CoW:
In [25]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [26]:df.iloc[0,0]=100In [27]:dfOut[27]: foo bar0 100 41 2 52 3 6
The objectdf
does not share any data with any other object and hence nocopy is triggered when updating the values. In contrast, the following operationtriggers a copy of the data under CoW:
In [28]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [29]:df2=df.reset_index(drop=True)In [30]:df2.iloc[0,0]=100In [31]:dfOut[31]: foo bar0 1 41 2 52 3 6In [32]:df2Out[32]: foo bar0 100 41 2 52 3 6
reset_index
returns a lazy copy with CoW while it copies the data without CoW.Since both objects,df
anddf2
share the same data, a copy is triggeredwhen modifyingdf2
. The objectdf
still has the same values as initiallywhiledf2
was modified.
If the objectdf
isn’t needed anymore after performing thereset_index
operation,you can emulate an inplace-like operation through assigning the output ofreset_index
to the same variable:
In [33]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [34]:df=df.reset_index(drop=True)In [35]:df.iloc[0,0]=100In [36]:dfOut[36]: foo bar0 100 41 2 52 3 6
The initial object gets out of scope as soon as the result ofreset_index
isreassigned and hencedf
does not share data with any other object. No copyis necessary when modifying the object. This is generally true for all methodslisted inCopy-on-Write optimizations.
Previously, when operating on views, the view and the parent object was modified:
In [37]:withpd.option_context("mode.copy_on_write",False): ....:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]}) ....:view=df[:] ....:df.iloc[0,0]=100 ....:In [38]:dfOut[38]: foo bar0 100 41 2 52 3 6In [39]:viewOut[39]: foo bar0 100 41 2 52 3 6
CoW triggers a copy whendf
is changed to avoid mutatingview
as well:
In [40]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [41]:view=df[:]In [42]:df.iloc[0,0]=100In [43]:dfOut[43]: foo bar0 100 41 2 52 3 6In [44]:viewOut[44]: foo bar0 1 41 2 52 3 6
Chained Assignment#
Chained assignment references a technique where an object is updated throughtwo subsequent indexing operations, e.g.
In [45]:withpd.option_context("mode.copy_on_write",False): ....:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]}) ....:df["foo"][df["bar"]>5]=100 ....:df ....:
The columnfoo
is updated where the columnbar
is greater than 5.This violates the CoW principles though, because it would have to modify theviewdf["foo"]
anddf
in one step. Hence, chained assignment willconsistently never work and raise aChainedAssignmentError
warningwith CoW enabled:
In [46]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [47]:df["foo"][df["bar"]>5]=100
With copy on write this can be done by usingloc
.
In [48]:df.loc[df["bar"]>5,"foo"]=100
Read-only NumPy arrays#
Accessing the underlying NumPy array of a DataFrame will return a read-only array if the arrayshares data with the initial DataFrame:
The array is a copy if the initial DataFrame consists of more than one array:
In [49]:df=pd.DataFrame({"a":[1,2],"b":[1.5,2.5]})In [50]:df.to_numpy()Out[50]:array([[1. , 1.5], [2. , 2.5]])
The array shares data with the DataFrame if the DataFrame consists of only one NumPy array:
In [51]:df=pd.DataFrame({"a":[1,2],"b":[3,4]})In [52]:df.to_numpy()Out[52]:array([[1, 3], [2, 4]])
This array is read-only, which means that it can’t be modified inplace:
In [53]:arr=df.to_numpy()In [54]:arr[0,0]=100---------------------------------------------------------------------------ValueErrorTraceback (most recent call last)CellIn[54],line1---->1arr[0,0]=100ValueError: assignment destination is read-only
The same holds true for a Series, since a Series always consists of a single array.
There are two potential solution to this:
Trigger a copy manually if you want to avoid updating DataFrames that share memory with your array.
Make the array writeable. This is a more performant solution but circumvents Copy-on-Write rules, soit should be used with caution.
In [55]:arr=df.to_numpy()In [56]:arr.flags.writeable=TrueIn [57]:arr[0,0]=100In [58]:arrOut[58]:array([[100, 3], [ 2, 4]])
Patterns to avoid#
No defensive copy will be performed if two objects share the same data whileyou are modifying one object inplace.
In [59]:df=pd.DataFrame({"a":[1,2,3],"b":[4,5,6]})In [60]:df2=df.reset_index(drop=True)In [61]:df2.iloc[0,0]=100
This creates two objects that share data and thus the setitem operation will trigger acopy. This is not necessary if the initial objectdf
isn’t needed anymore.Simply reassigning to the same variable will invalidate the reference that isheld by the object.
In [62]:df=pd.DataFrame({"a":[1,2,3],"b":[4,5,6]})In [63]:df=df.reset_index(drop=True)In [64]:df.iloc[0,0]=100
No copy is necessary in this example.Creating multiple references keeps unnecessary references aliveand thus will hurt performance with Copy-on-Write.
Copy-on-Write optimizations#
A new lazy copy mechanism that defers the copy until the object in question is modifiedand only if this object shares data with another object. This mechanism was added tomethods that don’t require a copy of the underlying data. Popular examples areDataFrame.drop()
foraxis=1
andDataFrame.rename()
.
These methods return views when Copy-on-Write is enabled, which provides a significantperformance improvement compared to the regular execution.
How to enable CoW#
Copy-on-Write can be enabled through the configuration optioncopy_on_write
. The option canbe turned on __globally__ through either of the following:
In [65]:pd.set_option("mode.copy_on_write",True)In [66]:pd.options.mode.copy_on_write=True