Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Ctrl+K

Copy-on-Write (CoW)#

Note

Copy-on-Write will become the default in pandas 3.0. We recommendturning it on nowto benefit from all improvements.

Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of theoptimizations that become possible through CoW are implemented and supported. All possibleoptimizations are supported starting from pandas 2.1.

CoW will be enabled by default in version 3.0.

CoW will lead to more predictable behavior since it is not possible to update more thanone object with one statement, e.g. indexing operations or methods won’t have side-effects. Additionally, throughdelaying copies as long as possible, the average performance and memory usage will improve.

Previous behavior#

pandas indexing behavior is tricky to understand. Some operations return views whileother return copies. Depending on the result of the operation, mutating one objectmight accidentally mutate another:

In [1]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [2]:subset=df["foo"]In [3]:subset.iloc[0]=100In [4]:dfOut[4]:   foo  bar0  100    41    2    52    3    6

Mutatingsubset, e.g. updating its values, also updatesdf. The exact behavior ishard to predict. Copy-on-Write solves accidentally modifying more than one object,it explicitly disallows this. With CoW enabled,df is unchanged:

In [5]:pd.options.mode.copy_on_write=TrueIn [6]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [7]:subset=df["foo"]In [8]:subset.iloc[0]=100In [9]:dfOut[9]:   foo  bar0    1    41    2    52    3    6

The following sections will explain what this means and how it impacts existingapplications.

Migrating to Copy-on-Write#

Copy-on-Write will be the default and only mode in pandas 3.0. This means that usersneed to migrate their code to be compliant with CoW rules.

The default mode in pandas will raise warnings for certain cases that will activelychange behavior and thus change user intended behavior.

We added another mode, e.g.

pd.options.mode.copy_on_write="warn"

that will warn for every operation that will change behavior with CoW. We expect this modeto be very noisy, since many cases that we don’t expect that they will influence users willalso emit a warning. We recommend checking this mode and analyzing the warnings, but it isnot necessary to address all of these warning. The first two items of the following listsare the only cases that need to be addressed to make existing code work with CoW.

The following few items describe the user visible changes:

Chained assignment will never work

loc should be used as an alternative. Check thechained assignment section for more details.

Accessing the underlying array of a pandas object will return a read-only view

In [10]:ser=pd.Series([1,2,3])In [11]:ser.to_numpy()Out[11]:array([1, 2, 3])

This example returns a NumPy array that is a view of the Series object. This view canbe modified and thus also modify the pandas object. This is not compliant with CoWrules. The returned array is set to non-writeable to protect against this behavior.Creating a copy of this array allows modification. You can also make the arraywriteable again if you don’t care about the pandas object anymore.

See the section aboutread-only NumPy arraysfor more details.

Only one pandas object is updated at once

The following code snippet updates bothdf andsubset without CoW:

In [12]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [13]:subset=df["foo"]In [14]:subset.iloc[0]=100In [15]:dfOut[15]:   foo  bar0    1    41    2    52    3    6

This won’t be possible anymore with CoW, since the CoW rules explicitly forbid this.This includes updating a single column as aSeries and relying on the changepropagating back to the parentDataFrame.This statement can be rewritten into a single statement withloc oriloc ifthis behavior is necessary.DataFrame.where() is another suitable alternativefor this case.

Updating a column selected from aDataFrame with an inplace method willalso not work anymore.

In [16]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [17]:df["foo"].replace(1,5,inplace=True)In [18]:dfOut[18]:   foo  bar0    1    41    2    52    3    6

This is another form of chained assignment. This can generally be rewritten in 2different forms:

In [19]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [20]:df.replace({"foo":{1:5}},inplace=True)In [21]:dfOut[21]:   foo  bar0    5    41    2    52    3    6

A different alternative would be to not useinplace:

In [22]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [23]:df["foo"]=df["foo"].replace(1,5)In [24]:dfOut[24]:   foo  bar0    5    41    2    52    3    6

Constructors now copy NumPy arrays by default

The Series and DataFrame constructors will now copy NumPy array by default when nototherwise specified. This was changed to avoid mutating a pandas object when theNumPy array is changed inplace outside of pandas. You can setcopy=False toavoid this copy.

Description#

CoW means that any DataFrame or Series derived from another in any way alwaysbehaves as a copy. As a consequence, we can only change the values of an objectthrough modifying the object itself. CoW disallows updating a DataFrame or a Seriesthat shares data with another DataFrame or Series object inplace.

This avoids side-effects when modifying values and hence, most methods can avoidactually copying the data and only trigger a copy when necessary.

The following example will operate inplace with CoW:

In [25]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [26]:df.iloc[0,0]=100In [27]:dfOut[27]:   foo  bar0  100    41    2    52    3    6

The objectdf does not share any data with any other object and hence nocopy is triggered when updating the values. In contrast, the following operationtriggers a copy of the data under CoW:

In [28]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [29]:df2=df.reset_index(drop=True)In [30]:df2.iloc[0,0]=100In [31]:dfOut[31]:   foo  bar0    1    41    2    52    3    6In [32]:df2Out[32]:   foo  bar0  100    41    2    52    3    6

reset_index returns a lazy copy with CoW while it copies the data without CoW.Since both objects,df anddf2 share the same data, a copy is triggeredwhen modifyingdf2. The objectdf still has the same values as initiallywhiledf2 was modified.

If the objectdf isn’t needed anymore after performing thereset_index operation,you can emulate an inplace-like operation through assigning the output ofreset_indexto the same variable:

In [33]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [34]:df=df.reset_index(drop=True)In [35]:df.iloc[0,0]=100In [36]:dfOut[36]:   foo  bar0  100    41    2    52    3    6

The initial object gets out of scope as soon as the result ofreset_index isreassigned and hencedf does not share data with any other object. No copyis necessary when modifying the object. This is generally true for all methodslisted inCopy-on-Write optimizations.

Previously, when operating on views, the view and the parent object was modified:

In [37]:withpd.option_context("mode.copy_on_write",False):   ....:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})   ....:view=df[:]   ....:df.iloc[0,0]=100   ....:In [38]:dfOut[38]:   foo  bar0  100    41    2    52    3    6In [39]:viewOut[39]:   foo  bar0  100    41    2    52    3    6

CoW triggers a copy whendf is changed to avoid mutatingview as well:

In [40]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [41]:view=df[:]In [42]:df.iloc[0,0]=100In [43]:dfOut[43]:   foo  bar0  100    41    2    52    3    6In [44]:viewOut[44]:   foo  bar0    1    41    2    52    3    6

Chained Assignment#

Chained assignment references a technique where an object is updated throughtwo subsequent indexing operations, e.g.

In [45]:withpd.option_context("mode.copy_on_write",False):   ....:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})   ....:df["foo"][df["bar"]>5]=100   ....:df   ....:

The columnfoo is updated where the columnbar is greater than 5.This violates the CoW principles though, because it would have to modify theviewdf["foo"] anddf in one step. Hence, chained assignment willconsistently never work and raise aChainedAssignmentError warningwith CoW enabled:

In [46]:df=pd.DataFrame({"foo":[1,2,3],"bar":[4,5,6]})In [47]:df["foo"][df["bar"]>5]=100

With copy on write this can be done by usingloc.

In [48]:df.loc[df["bar"]>5,"foo"]=100

Read-only NumPy arrays#

Accessing the underlying NumPy array of a DataFrame will return a read-only array if the arrayshares data with the initial DataFrame:

The array is a copy if the initial DataFrame consists of more than one array:

In [49]:df=pd.DataFrame({"a":[1,2],"b":[1.5,2.5]})In [50]:df.to_numpy()Out[50]:array([[1. , 1.5],       [2. , 2.5]])

The array shares data with the DataFrame if the DataFrame consists of only one NumPy array:

In [51]:df=pd.DataFrame({"a":[1,2],"b":[3,4]})In [52]:df.to_numpy()Out[52]:array([[1, 3],       [2, 4]])

This array is read-only, which means that it can’t be modified inplace:

In [53]:arr=df.to_numpy()In [54]:arr[0,0]=100---------------------------------------------------------------------------ValueErrorTraceback (most recent call last)CellIn[54],line1---->1arr[0,0]=100ValueError: assignment destination is read-only

The same holds true for a Series, since a Series always consists of a single array.

There are two potential solution to this:

  • Trigger a copy manually if you want to avoid updating DataFrames that share memory with your array.

  • Make the array writeable. This is a more performant solution but circumvents Copy-on-Write rules, soit should be used with caution.

In [55]:arr=df.to_numpy()In [56]:arr.flags.writeable=TrueIn [57]:arr[0,0]=100In [58]:arrOut[58]:array([[100,   3],       [  2,   4]])

Patterns to avoid#

No defensive copy will be performed if two objects share the same data whileyou are modifying one object inplace.

In [59]:df=pd.DataFrame({"a":[1,2,3],"b":[4,5,6]})In [60]:df2=df.reset_index(drop=True)In [61]:df2.iloc[0,0]=100

This creates two objects that share data and thus the setitem operation will trigger acopy. This is not necessary if the initial objectdf isn’t needed anymore.Simply reassigning to the same variable will invalidate the reference that isheld by the object.

In [62]:df=pd.DataFrame({"a":[1,2,3],"b":[4,5,6]})In [63]:df=df.reset_index(drop=True)In [64]:df.iloc[0,0]=100

No copy is necessary in this example.Creating multiple references keeps unnecessary references aliveand thus will hurt performance with Copy-on-Write.

Copy-on-Write optimizations#

A new lazy copy mechanism that defers the copy until the object in question is modifiedand only if this object shares data with another object. This mechanism was added tomethods that don’t require a copy of the underlying data. Popular examples areDataFrame.drop() foraxis=1andDataFrame.rename().

These methods return views when Copy-on-Write is enabled, which provides a significantperformance improvement compared to the regular execution.

How to enable CoW#

Copy-on-Write can be enabled through the configuration optioncopy_on_write. The option canbe turned on __globally__ through either of the following:

In [65]:pd.set_option("mode.copy_on_write",True)In [66]:pd.options.mode.copy_on_write=True

[8]ページ先頭

©2009-2025 Movatter.jp