Enter search terms or a module, class or function name.
Note
TheSparsePanel class has been removed in 0.19.0
We have implemented “sparse” versions of Series and DataFrame. These are not sparsein the typical “mostly 0”. Rather, you can view these objects as being “compressed”where any data matching a specific value (NaN / missing value, though any valuecan be chosen) is omitted. A specialSparseIndex object tracks where data has been“sparsified”. This will make much more sense in an example. All of the standard pandasdata structures have ato_sparse method:
In [1]:ts=pd.Series(randn(10))In [2]:ts[2:-2]=np.nanIn [3]:sts=ts.to_sparse()In [4]:stsOut[4]:0 0.4691121 -0.2828632 NaN3 NaN4 NaN5 NaN6 NaN7 NaN8 -0.8618499 -2.104569dtype: float64BlockIndexBlock locations: array([0, 8], dtype=int32)Block lengths: array([2, 2], dtype=int32)
Theto_sparse method takes akind argument (for the sparse index, seebelow) and afill_value. So if we had a mostly zero Series, we couldconvert it to sparse withfill_value=0:
In [5]:ts.fillna(0).to_sparse(fill_value=0)Out[5]:0 0.4691121 -0.2828632 0.0000003 0.0000004 0.0000005 0.0000006 0.0000007 0.0000008 -0.8618499 -2.104569dtype: float64BlockIndexBlock locations: array([0, 8], dtype=int32)Block lengths: array([2, 2], dtype=int32)
The sparse objects exist for memory efficiency reasons. Suppose you had alarge, mostly NA DataFrame:
In [6]:df=pd.DataFrame(randn(10000,4))In [7]:df.ix[:9998]=np.nanIn [8]:sdf=df.to_sparse()In [9]:sdfOut[9]: 0 1 2 30 NaN NaN NaN NaN1 NaN NaN NaN NaN2 NaN NaN NaN NaN3 NaN NaN NaN NaN4 NaN NaN NaN NaN5 NaN NaN NaN NaN6 NaN NaN NaN NaN... ... ... ... ...9993 NaN NaN NaN NaN9994 NaN NaN NaN NaN9995 NaN NaN NaN NaN9996 NaN NaN NaN NaN9997 NaN NaN NaN NaN9998 NaN NaN NaN NaN9999 0.280249 -1.648493 1.490865 -0.890819[10000 rows x 4 columns]In [10]:sdf.densityOut[10]:0.0001
As you can see, the density (% of values that have not been “compressed”) isextremely low. This sparse object takes up much less memory on disk (pickled)and in the Python interpreter. Functionally, their behavior should be nearlyidentical to their dense counterparts.
Any sparse object can be converted back to the standard dense form by callingto_dense:
In [11]:sts.to_dense()Out[11]:0 0.4691121 -0.2828632 NaN3 NaN4 NaN5 NaN6 NaN7 NaN8 -0.8618499 -2.104569dtype: float64
SparseArray is the base layer for all of the sparse indexed datastructures. It is a 1-dimensional ndarray-like object storing only valuesdistinct from thefill_value:
In [12]:arr=np.random.randn(10)In [13]:arr[2:5]=np.nan;arr[7:8]=np.nanIn [14]:sparr=pd.SparseArray(arr)In [15]:sparrOut[15]:[-1.95566352972, -1.6588664276, nan, nan, nan, 1.15893288864, 0.145297113733, nan, 0.606027190513, 1.33421134013]Fill: nanIntIndexIndices: array([0, 1, 5, 6, 8, 9], dtype=int32)
Like the indexed objects (SparseSeries, SparseDataFrame), aSparseArraycan be converted back to a regular ndarray by callingto_dense:
In [16]:sparr.to_dense()Out[16]:array([-1.9557, -1.6589, nan, nan, nan, 1.1589, 0.1453, nan, 0.606 , 1.3342])
TheSparseList class has been deprecated and will be removed in a future version.See thedocs of a previous versionfor documentation onSparseList.
Two kinds ofSparseIndex are implemented,block andinteger. Werecommend usingblock as it’s more memory efficient. Theinteger formatkeeps an arrays of all of the locations where the data are not equal to thefill value. Theblock format tracks only the locations and sizes of blocksof data.
Sparse data should have the same dtype as its dense representation. Currently,float64,int64 andbool dtypes are supported. Depending on the originaldtype,fill_value default changes:
float64:np.nanint64:0bool:FalseIn [17]:s=pd.Series([1,np.nan,np.nan])In [18]:sOut[18]:0 1.01 NaN2 NaNdtype: float64In [19]:s.to_sparse()Out[19]:0 1.01 NaN2 NaNdtype: float64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([1], dtype=int32)In [20]:s=pd.Series([1,0,0])In [21]:sOut[21]:0 11 02 0dtype: int64In [22]:s.to_sparse()Out[22]:0 11 02 0dtype: int64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([1], dtype=int32)In [23]:s=pd.Series([True,False,True])In [24]:sOut[24]:0 True1 False2 Truedtype: boolIn [25]:s.to_sparse()Out[25]:0 True1 False2 Truedtype: boolBlockIndexBlock locations: array([0, 2], dtype=int32)Block lengths: array([1, 1], dtype=int32)
You can change the dtype using.astype(), the result is also sparse. Note that.astype() also affects to thefill_value to keep its dense represantation.
In [26]:s=pd.Series([1,0,0,0,0])In [27]:sOut[27]:0 11 02 03 04 0dtype: int64In [28]:ss=s.to_sparse()In [29]:ssOut[29]:0 11 02 03 04 0dtype: int64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([1], dtype=int32)In [30]:ss.astype(np.float64)Out[30]:0 1.01 0.02 0.03 0.04 0.0dtype: float64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([1], dtype=int32)
It raises if any value cannot be coerced to specified dtype.
In [1]:ss=pd.Series([1,np.nan,np.nan]).to_sparse()0 1.01 NaN2 NaNdtype: float64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([1], dtype=int32)In [2]:ss.astype(np.int64)ValueError: unable to coerce current fill_value nan to int64 dtype
You can apply NumPyufuncs toSparseArray and get aSparseArray as a result.
In [31]:arr=pd.SparseArray([1.,np.nan,np.nan,-2.,np.nan])In [32]:np.abs(arr)Out[32]:[1.0, nan, nan, 2.0, nan]Fill: nanIntIndexIndices: array([0, 3], dtype=int32)
Theufunc is also applied tofill_value. This is needed to getthe correct dense result.
In [33]:arr=pd.SparseArray([1.,-1,-1,-2.,-1],fill_value=-1)In [34]:np.abs(arr)Out[34]:[1.0, 1, 1, 2.0, 1]Fill: 1IntIndexIndices: array([0, 3], dtype=int32)In [35]:np.abs(arr).to_dense()Out[35]:array([1.,1.,1.,2.,1.])
Experimental api to transform between sparse pandas and scipy.sparse structures.
ASparseSeries.to_coo() method is implemented for transforming aSparseSeries indexed by aMultiIndex to ascipy.sparse.coo_matrix.
The method requires aMultiIndex with two or more levels.
In [36]:s=pd.Series([3.0,np.nan,1.0,3.0,np.nan,np.nan])In [37]:s.index=pd.MultiIndex.from_tuples([(1,2,'a',0), ....:(1,2,'a',1), ....:(1,1,'b',0), ....:(1,1,'b',1), ....:(2,1,'b',0), ....:(2,1,'b',1)], ....:names=['A','B','C','D']) ....:In [38]:sOut[38]:A B C D1 2 a 0 3.0 1 NaN 1 b 0 1.0 1 3.02 1 b 0 NaN 1 NaNdtype: float64# SparseSeriesIn [39]:ss=s.to_sparse()In [40]:ssOut[40]:A B C D1 2 a 0 3.0 1 NaN 1 b 0 1.0 1 3.02 1 b 0 NaN 1 NaNdtype: float64BlockIndexBlock locations: array([0, 2], dtype=int32)Block lengths: array([1, 2], dtype=int32)
In the example below, we transform theSparseSeries to a sparse representation of a 2-d array by specifying that the first and secondMultiIndex levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.
In [41]:A,rows,columns=ss.to_coo(row_levels=['A','B'], ....:column_levels=['C','D'], ....:sort_labels=True) ....:In [42]:AOut[42]:<3x4 sparse matrix of type '<type 'numpy.float64'>'with 3 stored elements in COOrdinate format>In [43]:A.todense()Out[43]:matrix([[ 0., 0., 1., 3.], [ 3., 0., 0., 0.], [ 0., 0., 0., 0.]])In [44]:rowsOut[44]:[(1,1),(1,2),(2,1)]In [45]:columnsOut[45]:[('a',0),('a',1),('b',0),('b',1)]
Specifying different row and column labels (and not sorting them) yields a different sparse matrix:
In [46]:A,rows,columns=ss.to_coo(row_levels=['A','B','C'], ....:column_levels=['D'], ....:sort_labels=False) ....:In [47]:AOut[47]:<3x2 sparse matrix of type '<type 'numpy.float64'>'with 3 stored elements in COOrdinate format>In [48]:A.todense()Out[48]:matrix([[ 3., 0.], [ 1., 3.], [ 0., 0.]])In [49]:rowsOut[49]:[(1,2,'a'),(1,1,'b'),(2,1,'b')]In [50]:columnsOut[50]:[0,1]
A convenience methodSparseSeries.from_coo() is implemented for creating aSparseSeries from ascipy.sparse.coo_matrix.
In [51]:fromscipyimportsparseIn [52]:A=sparse.coo_matrix(([3.0,1.0,2.0],([1,0,0],[0,2,3])), ....:shape=(3,4)) ....:In [53]:AOut[53]:<3x4 sparse matrix of type '<type 'numpy.float64'>'with 3 stored elements in COOrdinate format>In [54]:A.todense()Out[54]:matrix([[ 0., 0., 1., 2.], [ 3., 0., 0., 0.], [ 0., 0., 0., 0.]])
The default behaviour (withdense_index=False) simply returns aSparseSeries containingonly the non-null entries.
In [55]:ss=pd.SparseSeries.from_coo(A)In [56]:ssOut[56]:0 2 1.0 3 2.01 0 3.0dtype: float64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([3], dtype=int32)
Specifyingdense_index=True will result in an index that is the Cartesian product of therow and columns coordinates of the matrix. Note that this will consume a significant amount of memory(relative todense_index=False) if the sparse matrix is large (and sparse) enough.
In [57]:ss_dense=pd.SparseSeries.from_coo(A,dense_index=True)In [58]:ss_denseOut[58]:0 0 NaN 1 NaN 2 1.0 3 2.01 0 3.0 1 NaN 2 NaN 3 NaN2 0 NaN 1 NaN 2 NaN 3 NaNdtype: float64BlockIndexBlock locations: array([2], dtype=int32)Block lengths: array([3], dtype=int32)