Copy on write#

Copy on Write is a mechanism to simplify the indexing API and improveperformance through avoiding copies if possible.CoW means that any DataFrame or Series derived from another in any way alwaysbehaves as a copy. An explanation on how to use Copy on Write efficiently can befoundhere.

Reference tracking#

To be able to determine if we have to make a copy when writing into a DataFrame,we have to be aware if the values are shared with another DataFrame. pandaskeeps track of allBlocks that share values with another block internally tobe able to tell when a copy needs to be triggered. The reference trackingmechanism is implemented on the Block level.

We use a custom reference tracker object,BlockValuesRefs, that keepstrack of every block, whose values share memory with each other. The referenceis held through a weak-reference. Every pair of blocks that share some memory shouldpoint to the sameBlockValuesRefs object. If one block goes out ofscope, the reference to this block dies. As a consequence, the reference trackerobject always knows how many blocks are alive and share memory.

Whenever aDataFrame orSeries object is sharing data with anotherobject, it is required that each of those objects have its own BlockManager and Blockobjects. Thus, in other words, one Block instance (that is held by a DataFrame, notnecessarily for intermediate objects) should always be uniquely used for onlya single DataFrame/Series object. For example, when you want to use the sameBlock for another object, you can create a shallow copy of the Block instancewithblock.copy(deep=False) (which will create a new Block instance withthe same underlying values and which will correctly set up the references).

We can ask the reference tracking object if there is another block alive that sharesdata with us before writing into the values. We can trigger a copy beforewriting if there is in fact another block alive.