- User Guide
- Sparse data...
Sparse data structures#
pandas provides data structures for efficiently storing sparse data.These are not necessarily sparse in the typical “mostly 0”. Rather, you can view theseobjects as being “compressed” where any data matching a specific value (NaN
/ missing value, though any valuecan be chosen, including 0) is omitted. The compressed values are not actually stored in the array.
In [1]:arr=np.random.randn(10)In [2]:arr[2:-2]=np.nanIn [3]:ts=pd.Series(pd.arrays.SparseArray(arr))In [4]:tsOut[4]:0 0.4691121 -0.2828632 NaN3 NaN4 NaN5 NaN6 NaN7 NaN8 -0.8618499 -2.104569dtype: Sparse[float64, nan]
Notice the dtype,Sparse[float64,nan]
. Thenan
means that elements in thearray that arenan
aren’t actually stored, only the non-nan
elements are.Those non-nan
elements have afloat64
dtype.
The sparse objects exist for memory efficiency reasons. Suppose you had alarge, mostly NADataFrame
:
In [5]:df=pd.DataFrame(np.random.randn(10000,4))In [6]:df.iloc[:9998]=np.nanIn [7]:sdf=df.astype(pd.SparseDtype("float",np.nan))In [8]:sdf.head()Out[8]: 0 1 2 30 NaN NaN NaN NaN1 NaN NaN NaN NaN2 NaN NaN NaN NaN3 NaN NaN NaN NaN4 NaN NaN NaN NaNIn [9]:sdf.dtypesOut[9]:0 Sparse[float64, nan]1 Sparse[float64, nan]2 Sparse[float64, nan]3 Sparse[float64, nan]dtype: objectIn [10]:sdf.sparse.densityOut[10]:0.0002
As you can see, the density (% of values that have not been “compressed”) isextremely low. This sparse object takes up much less memory on disk (pickled)and in the Python interpreter.
In [11]:'dense :{:0.2f} bytes'.format(df.memory_usage().sum()/1e3)Out[11]:'dense : 320.13 bytes'In [12]:'sparse:{:0.2f} bytes'.format(sdf.memory_usage().sum()/1e3)Out[12]:'sparse: 0.22 bytes'
Functionally, their behavior should be nearlyidentical to their dense counterparts.
SparseArray#
arrays.SparseArray
is aExtensionArray
for storing an array of sparse values (seedtypes for moreon extension arrays). It is a 1-dimensional ndarray-like object storingonly values distinct from thefill_value
:
In [13]:arr=np.random.randn(10)In [14]:arr[2:5]=np.nanIn [15]:arr[7:8]=np.nanIn [16]:sparr=pd.arrays.SparseArray(arr)In [17]:sparrOut[17]:[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]Fill: nanIntIndexIndices: array([0, 1, 5, 6, 8, 9], dtype=int32)
A sparse array can be converted to a regular (dense) ndarray withnumpy.asarray()
In [18]:np.asarray(sparr)Out[18]:array([-1.9557, -1.6589, nan, nan, nan, 1.1589, 0.1453, nan, 0.606 , 1.3342])
SparseDtype#
TheSparseArray.dtype
property stores two pieces of information
The dtype of the non-sparse values
The scalar fill value
In [19]:sparr.dtypeOut[19]:Sparse[float64, nan]
ASparseDtype
may be constructed by passing only a dtype
In [20]:pd.SparseDtype(np.dtype('datetime64[ns]'))Out[20]:Sparse[datetime64[ns], numpy.datetime64('NaT')]
in which case a default fill value will be used (for NumPy dtypes this is often the“missing” value for that dtype). To override this default an explicit fill value may bepassed instead
In [21]:pd.SparseDtype(np.dtype('datetime64[ns]'), ....:fill_value=pd.Timestamp('2017-01-01')) ....:Out[21]:Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]
Finally, the string alias'Sparse[dtype]'
may be used to specify a sparse dtypein many places
In [22]:pd.array([1,0,0,2],dtype='Sparse[int]')Out[22]:[1, 0, 0, 2]Fill: 0IntIndexIndices: array([0, 3], dtype=int32)
Sparse accessor#
pandas provides a.sparse
accessor, similar to.str
for string data,.cat
for categorical data, and.dt
for datetime-like data. This namespace providesattributes and methods that are specific to sparse data.
In [23]:s=pd.Series([0,0,1,2],dtype="Sparse[int]")In [24]:s.sparse.densityOut[24]:0.5In [25]:s.sparse.fill_valueOut[25]:0
This accessor is available only on data withSparseDtype
, and on theSeries
class itself for creating a Series with sparse data from a scipy COO matrix with.
A.sparse
accessor has been added forDataFrame
as well.SeeSparse accessor for more.
Sparse calculation#
You can apply NumPyufuncstoarrays.SparseArray
and get aarrays.SparseArray
as a result.
In [26]:arr=pd.arrays.SparseArray([1.,np.nan,np.nan,-2.,np.nan])In [27]:np.abs(arr)Out[27]:[1.0, nan, nan, 2.0, nan]Fill: nanIntIndexIndices: array([0, 3], dtype=int32)
Theufunc is also applied tofill_value
. This is needed to getthe correct dense result.
In [28]:arr=pd.arrays.SparseArray([1.,-1,-1,-2.,-1],fill_value=-1)In [29]:np.abs(arr)Out[29]:[1, 1, 1, 2.0, 1]Fill: 1IntIndexIndices: array([3], dtype=int32)In [30]:np.abs(arr).to_dense()Out[30]:array([1., 1., 1., 2., 1.])
Conversion
To convert data from sparse to dense, use the.sparse
accessors
In [31]:sdf.sparse.to_dense()Out[31]: 0 1 2 30 NaN NaN NaN NaN1 NaN NaN NaN NaN2 NaN NaN NaN NaN3 NaN NaN NaN NaN4 NaN NaN NaN NaN... ... ... ... ...9995 NaN NaN NaN NaN9996 NaN NaN NaN NaN9997 NaN NaN NaN NaN9998 0.509184 -0.774928 -1.369894 -0.3821419999 0.280249 -1.648493 1.490865 -0.890819[10000 rows x 4 columns]
From dense to sparse, useDataFrame.astype()
with aSparseDtype
.
In [32]:dense=pd.DataFrame({"A":[1,0,0,1]})In [33]:dtype=pd.SparseDtype(int,fill_value=0)In [34]:dense.astype(dtype)Out[34]: A0 11 02 03 1
Interaction withscipy.sparse#
UseDataFrame.sparse.from_spmatrix()
to create aDataFrame
with sparse values from a sparse matrix.
In [35]:fromscipy.sparseimportcsr_matrixIn [36]:arr=np.random.random(size=(1000,5))In [37]:arr[arr<.9]=0In [38]:sp_arr=csr_matrix(arr)In [39]:sp_arrOut[39]:<Compressed Sparse Row sparse matrix of dtype 'float64'with 517 stored elements and shape (1000, 5)>In [40]:sdf=pd.DataFrame.sparse.from_spmatrix(sp_arr)In [41]:sdf.head()Out[41]: 0 1 2 3 40 0.95638 0 0 0 01 0 0 0 0 02 0 0 0 0 03 0 0 0 0 04 0.999552 0 0 0.956153 0In [42]:sdf.dtypesOut[42]:0 Sparse[float64, 0]1 Sparse[float64, 0]2 Sparse[float64, 0]3 Sparse[float64, 0]4 Sparse[float64, 0]dtype: object
All sparse formats are supported, but matrices that are not inCOOrdinate
format will be converted, copying data as needed.To convert back to sparse SciPy matrix in COO format, you can use theDataFrame.sparse.to_coo()
method:
In [43]:sdf.sparse.to_coo()Out[43]:<COOrdinate sparse matrix of dtype 'float64'with 517 stored elements and shape (1000, 5)>
Series.sparse.to_coo()
is implemented for transforming aSeries
with sparse values indexed by aMultiIndex
to ascipy.sparse.coo_matrix
.
The method requires aMultiIndex
with two or more levels.
In [44]:s=pd.Series([3.0,np.nan,1.0,3.0,np.nan,np.nan])In [45]:s.index=pd.MultiIndex.from_tuples( ....:[ ....:(1,2,"a",0), ....:(1,2,"a",1), ....:(1,1,"b",0), ....:(1,1,"b",1), ....:(2,1,"b",0), ....:(2,1,"b",1), ....:], ....:names=["A","B","C","D"], ....:) ....:In [46]:ss=s.astype('Sparse')In [47]:ssOut[47]:A B C D1 2 a 0 3.0 1 NaN 1 b 0 1.0 1 3.02 1 b 0 NaN 1 NaNdtype: Sparse[float64, nan]
In the example below, we transform theSeries
to a sparse representation of a 2-d array by specifying that the first and secondMultiIndex
levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.
In [48]:A,rows,columns=ss.sparse.to_coo( ....:row_levels=["A","B"],column_levels=["C","D"],sort_labels=True ....:) ....:In [49]:AOut[49]:<COOrdinate sparse matrix of dtype 'float64'with 3 stored elements and shape (3, 4)>In [50]:A.todense()Out[50]:matrix([[0., 0., 1., 3.], [3., 0., 0., 0.], [0., 0., 0., 0.]])In [51]:rowsOut[51]:[(1, 1), (1, 2), (2, 1)]In [52]:columnsOut[52]:[('a', 0), ('a', 1), ('b', 0), ('b', 1)]
Specifying different row and column labels (and not sorting them) yields a different sparse matrix:
In [53]:A,rows,columns=ss.sparse.to_coo( ....:row_levels=["A","B","C"],column_levels=["D"],sort_labels=False ....:) ....:In [54]:AOut[54]:<COOrdinate sparse matrix of dtype 'float64'with 3 stored elements and shape (3, 2)>In [55]:A.todense()Out[55]:matrix([[3., 0.], [1., 3.], [0., 0.]])In [56]:rowsOut[56]:[(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]In [57]:columnsOut[57]:[(0,), (1,)]
A convenience methodSeries.sparse.from_coo()
is implemented for creating aSeries
with sparse values from ascipy.sparse.coo_matrix
.
In [58]:fromscipyimportsparseIn [59]:A=sparse.coo_matrix(([3.0,1.0,2.0],([1,0,0],[0,2,3])),shape=(3,4))In [60]:AOut[60]:<COOrdinate sparse matrix of dtype 'float64'with 3 stored elements and shape (3, 4)>In [61]:A.todense()Out[61]:matrix([[0., 0., 1., 2.], [3., 0., 0., 0.], [0., 0., 0., 0.]])
The default behaviour (withdense_index=False
) simply returns aSeries
containingonly the non-null entries.
In [62]:ss=pd.Series.sparse.from_coo(A)In [63]:ssOut[63]:0 2 1.0 3 2.01 0 3.0dtype: Sparse[float64, nan]
Specifyingdense_index=True
will result in an index that is the Cartesian product of therow and columns coordinates of the matrix. Note that this will consume a significant amount of memory(relative todense_index=False
) if the sparse matrix is large (and sparse) enough.
In [64]:ss_dense=pd.Series.sparse.from_coo(A,dense_index=True)In [65]:ss_denseOut[65]:1 0 3.0 2 NaN 3 NaN0 0 NaN 2 1.0 3 2.0 0 NaN 2 1.0 3 2.0dtype: Sparse[float64, nan]