Sparse data structures #

pandas provides data structures for efficiently storing sparse data.These are not necessarily sparse in the typical “mostly 0”. Rather, you can view theseobjects as being “compressed” where any data matching a specific value (NaN / missing value, though any valuecan be chosen, including 0) is omitted. The compressed values are not actually stored in the array.

In [1]:arr=np.random.randn(10)In [2]:arr[2:-2]=np.nanIn [3]:ts=pd.Series(pd.arrays.SparseArray(arr))In [4]:tsOut[4]:0    0.4691121   -0.2828632         NaN3         NaN4         NaN5         NaN6         NaN7         NaN8   -0.8618499   -2.104569dtype: Sparse[float64, nan]

Notice the dtype,Sparse[float64,nan]. Thenan means that elements in thearray that arenan aren’t actually stored, only the non-nan elements are.Those non-nan elements have afloat64 dtype.

The sparse objects exist for memory efficiency reasons. Suppose you had alarge, mostly NADataFrame:

In [5]:df=pd.DataFrame(np.random.randn(10000,4))In [6]:df.iloc[:9998]=np.nanIn [7]:sdf=df.astype(pd.SparseDtype("float",np.nan))In [8]:sdf.head()Out[8]:     0    1    2    30  NaN  NaN  NaN  NaN1  NaN  NaN  NaN  NaN2  NaN  NaN  NaN  NaN3  NaN  NaN  NaN  NaN4  NaN  NaN  NaN  NaNIn [9]:sdf.dtypesOut[9]:0    Sparse[float64, nan]1    Sparse[float64, nan]2    Sparse[float64, nan]3    Sparse[float64, nan]dtype: objectIn [10]:sdf.sparse.densityOut[10]:0.0002

As you can see, the density (% of values that have not been “compressed”) isextremely low. This sparse object takes up much less memory on disk (pickled)and in the Python interpreter.

In [11]:'dense :{:0.2f} bytes'.format(df.memory_usage().sum()/1e3)Out[11]:'dense : 320.13 bytes'In [12]:'sparse:{:0.2f} bytes'.format(sdf.memory_usage().sum()/1e3)Out[12]:'sparse: 0.22 bytes'

Functionally, their behavior should be nearlyidentical to their dense counterparts.

SparseArray#

arrays.SparseArray is aExtensionArrayfor storing an array of sparse values (seedtypes for moreon extension arrays). It is a 1-dimensional ndarray-like object storingonly values distinct from thefill_value:

In [13]:arr=np.random.randn(10)In [14]:arr[2:5]=np.nanIn [15]:arr[7:8]=np.nanIn [16]:sparr=pd.arrays.SparseArray(arr)In [17]:sparrOut[17]:[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]Fill: nanIntIndexIndices: array([0, 1, 5, 6, 8, 9], dtype=int32)

A sparse array can be converted to a regular (dense) ndarray withnumpy.asarray()

In [18]:np.asarray(sparr)Out[18]:array([-1.9557, -1.6589,     nan,     nan,     nan,  1.1589,  0.1453,           nan,  0.606 ,  1.3342])

SparseDtype#

TheSparseArray.dtype property stores two pieces of information

The dtype of the non-sparse values
The scalar fill value

In [19]:sparr.dtypeOut[19]:Sparse[float64, nan]

ASparseDtype may be constructed by passing only a dtype

In [20]:pd.SparseDtype(np.dtype('datetime64[ns]'))Out[20]:Sparse[datetime64[ns], numpy.datetime64('NaT')]

in which case a default fill value will be used (for NumPy dtypes this is often the“missing” value for that dtype). To override this default an explicit fill value may bepassed instead

In [21]:pd.SparseDtype(np.dtype('datetime64[ns]'),   ....:fill_value=pd.Timestamp('2017-01-01'))   ....:Out[21]:Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]

Finally, the string alias'Sparse[dtype]' may be used to specify a sparse dtypein many places

In [22]:pd.array([1,0,0,2],dtype='Sparse[int]')Out[22]:[1, 0, 0, 2]Fill: 0IntIndexIndices: array([0, 3], dtype=int32)

Sparse accessor#

pandas provides a.sparse accessor, similar to.str for string data,.catfor categorical data, and.dt for datetime-like data. This namespace providesattributes and methods that are specific to sparse data.

In [23]:s=pd.Series([0,0,1,2],dtype="Sparse[int]")In [24]:s.sparse.densityOut[24]:0.5In [25]:s.sparse.fill_valueOut[25]:0

This accessor is available only on data withSparseDtype, and on theSeriesclass itself for creating a Series with sparse data from a scipy COO matrix with.

A.sparse accessor has been added forDataFrame as well.SeeSparse accessor for more.

Sparse calculation#

You can apply NumPyufuncstoarrays.SparseArray and get aarrays.SparseArray as a result.

In [26]:arr=pd.arrays.SparseArray([1.,np.nan,np.nan,-2.,np.nan])In [27]:np.abs(arr)Out[27]:[1.0, nan, nan, 2.0, nan]Fill: nanIntIndexIndices: array([0, 3], dtype=int32)

Theufunc is also applied tofill_value. This is needed to getthe correct dense result.

In [28]:arr=pd.arrays.SparseArray([1.,-1,-1,-2.,-1],fill_value=-1)In [29]:np.abs(arr)Out[29]:[1, 1, 1, 2.0, 1]Fill: 1IntIndexIndices: array([3], dtype=int32)In [30]:np.abs(arr).to_dense()Out[30]:array([1., 1., 1., 2., 1.])

Conversion

To convert data from sparse to dense, use the.sparse accessors

In [31]:sdf.sparse.to_dense()Out[31]:             0         1         2         30          NaN       NaN       NaN       NaN1          NaN       NaN       NaN       NaN2          NaN       NaN       NaN       NaN3          NaN       NaN       NaN       NaN4          NaN       NaN       NaN       NaN...        ...       ...       ...       ...9995       NaN       NaN       NaN       NaN9996       NaN       NaN       NaN       NaN9997       NaN       NaN       NaN       NaN9998  0.509184 -0.774928 -1.369894 -0.3821419999  0.280249 -1.648493  1.490865 -0.890819[10000 rows x 4 columns]

From dense to sparse, useDataFrame.astype() with aSparseDtype.

In [32]:dense=pd.DataFrame({"A":[1,0,0,1]})In [33]:dtype=pd.SparseDtype(int,fill_value=0)In [34]:dense.astype(dtype)Out[34]:   A0  11  02  03  1

Interaction withscipy.sparse#

UseDataFrame.sparse.from_spmatrix() to create aDataFrame with sparse values from a sparse matrix.

In [35]:fromscipy.sparseimportcsr_matrixIn [36]:arr=np.random.random(size=(1000,5))In [37]:arr[arr<.9]=0In [38]:sp_arr=csr_matrix(arr)In [39]:sp_arrOut[39]:<Compressed Sparse Row sparse matrix of dtype 'float64'with 517 stored elements and shape (1000, 5)>In [40]:sdf=pd.DataFrame.sparse.from_spmatrix(sp_arr)In [41]:sdf.head()Out[41]:          0  1  2         3  40   0.95638  0  0         0  01         0  0  0         0  02         0  0  0         0  03         0  0  0         0  04  0.999552  0  0  0.956153  0In [42]:sdf.dtypesOut[42]:0    Sparse[float64, 0]1    Sparse[float64, 0]2    Sparse[float64, 0]3    Sparse[float64, 0]4    Sparse[float64, 0]dtype: object

All sparse formats are supported, but matrices that are not inCOOrdinate format will be converted, copying data as needed.To convert back to sparse SciPy matrix in COO format, you can use theDataFrame.sparse.to_coo() method:

In [43]:sdf.sparse.to_coo()Out[43]:<COOrdinate sparse matrix of dtype 'float64'with 517 stored elements and shape (1000, 5)>

Series.sparse.to_coo() is implemented for transforming aSeries with sparse values indexed by aMultiIndex to ascipy.sparse.coo_matrix.

The method requires aMultiIndex with two or more levels.

In [44]:s=pd.Series([3.0,np.nan,1.0,3.0,np.nan,np.nan])In [45]:s.index=pd.MultiIndex.from_tuples(   ....:[   ....:(1,2,"a",0),   ....:(1,2,"a",1),   ....:(1,1,"b",0),   ....:(1,1,"b",1),   ....:(2,1,"b",0),   ....:(2,1,"b",1),   ....:],   ....:names=["A","B","C","D"],   ....:)   ....:In [46]:ss=s.astype('Sparse')In [47]:ssOut[47]:A  B  C  D1  2  a  0    3.0         1    NaN   1  b  0    1.0         1    3.02  1  b  0    NaN         1    NaNdtype: Sparse[float64, nan]

In the example below, we transform theSeries to a sparse representation of a 2-d array by specifying that the first and secondMultiIndex levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.

In [48]:A,rows,columns=ss.sparse.to_coo(   ....:row_levels=["A","B"],column_levels=["C","D"],sort_labels=True   ....:)   ....:In [49]:AOut[49]:<COOrdinate sparse matrix of dtype 'float64'with 3 stored elements and shape (3, 4)>In [50]:A.todense()Out[50]:matrix([[0., 0., 1., 3.],        [3., 0., 0., 0.],        [0., 0., 0., 0.]])In [51]:rowsOut[51]:[(1, 1), (1, 2), (2, 1)]In [52]:columnsOut[52]:[('a', 0), ('a', 1), ('b', 0), ('b', 1)]

Specifying different row and column labels (and not sorting them) yields a different sparse matrix:

In [53]:A,rows,columns=ss.sparse.to_coo(   ....:row_levels=["A","B","C"],column_levels=["D"],sort_labels=False   ....:)   ....:In [54]:AOut[54]:<COOrdinate sparse matrix of dtype 'float64'with 3 stored elements and shape (3, 2)>In [55]:A.todense()Out[55]:matrix([[3., 0.],        [1., 3.],        [0., 0.]])In [56]:rowsOut[56]:[(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]In [57]:columnsOut[57]:[(0,), (1,)]

A convenience methodSeries.sparse.from_coo() is implemented for creating aSeries with sparse values from ascipy.sparse.coo_matrix.

In [58]:fromscipyimportsparseIn [59]:A=sparse.coo_matrix(([3.0,1.0,2.0],([1,0,0],[0,2,3])),shape=(3,4))In [60]:AOut[60]:<COOrdinate sparse matrix of dtype 'float64'with 3 stored elements and shape (3, 4)>In [61]:A.todense()Out[61]:matrix([[0., 0., 1., 2.],        [3., 0., 0., 0.],        [0., 0., 0., 0.]])

The default behaviour (withdense_index=False) simply returns aSeries containingonly the non-null entries.

In [62]:ss=pd.Series.sparse.from_coo(A)In [63]:ssOut[63]:0  2    1.0   3    2.01  0    3.0dtype: Sparse[float64, nan]

Specifyingdense_index=True will result in an index that is the Cartesian product of therow and columns coordinates of the matrix. Note that this will consume a significant amount of memory(relative todense_index=False) if the sparse matrix is large (and sparse) enough.

In [64]:ss_dense=pd.Series.sparse.from_coo(A,dense_index=True)In [65]:ss_denseOut[65]:1  0    3.0   2    NaN   3    NaN0  0    NaN   2    1.0   3    2.0   0    NaN   2    1.0   3    2.0dtype: Sparse[float64, nan]

On this page

Show Source

Movatterモバイル変換