Movatterモバイル変換


[0]ホーム

URL:


Navigation

Table Of Contents

Search

Enter search terms or a module, class or function name.

Sparse data structures

Note

TheSparsePanel class has been removed in 0.19.0

We have implemented “sparse” versions of Series and DataFrame. These are not sparsein the typical “mostly 0”. Rather, you can view these objects as being “compressed”where any data matching a specific value (NaN / missing value, though any valuecan be chosen) is omitted. A specialSparseIndex object tracks where data has been“sparsified”. This will make much more sense in an example. All of the standard pandasdata structures have ato_sparse method:

In [1]:ts=pd.Series(randn(10))In [2]:ts[2:-2]=np.nanIn [3]:sts=ts.to_sparse()In [4]:stsOut[4]:0    0.4691121   -0.2828632         NaN3         NaN4         NaN5         NaN6         NaN7         NaN8   -0.8618499   -2.104569dtype: float64BlockIndexBlock locations: array([0, 8], dtype=int32)Block lengths: array([2, 2], dtype=int32)

Theto_sparse method takes akind argument (for the sparse index, seebelow) and afill_value. So if we had a mostly zero Series, we couldconvert it to sparse withfill_value=0:

In [5]:ts.fillna(0).to_sparse(fill_value=0)Out[5]:0    0.4691121   -0.2828632    0.0000003    0.0000004    0.0000005    0.0000006    0.0000007    0.0000008   -0.8618499   -2.104569dtype: float64BlockIndexBlock locations: array([0, 8], dtype=int32)Block lengths: array([2, 2], dtype=int32)

The sparse objects exist for memory efficiency reasons. Suppose you had alarge, mostly NA DataFrame:

In [6]:df=pd.DataFrame(randn(10000,4))In [7]:df.ix[:9998]=np.nanIn [8]:sdf=df.to_sparse()In [9]:sdfOut[9]:             0         1         2         30          NaN       NaN       NaN       NaN1          NaN       NaN       NaN       NaN2          NaN       NaN       NaN       NaN3          NaN       NaN       NaN       NaN4          NaN       NaN       NaN       NaN5          NaN       NaN       NaN       NaN6          NaN       NaN       NaN       NaN...        ...       ...       ...       ...9993       NaN       NaN       NaN       NaN9994       NaN       NaN       NaN       NaN9995       NaN       NaN       NaN       NaN9996       NaN       NaN       NaN       NaN9997       NaN       NaN       NaN       NaN9998       NaN       NaN       NaN       NaN9999  0.280249 -1.648493  1.490865 -0.890819[10000 rows x 4 columns]In [10]:sdf.densityOut[10]:0.0001

As you can see, the density (% of values that have not been “compressed”) isextremely low. This sparse object takes up much less memory on disk (pickled)and in the Python interpreter. Functionally, their behavior should be nearlyidentical to their dense counterparts.

Any sparse object can be converted back to the standard dense form by callingto_dense:

In [11]:sts.to_dense()Out[11]:0    0.4691121   -0.2828632         NaN3         NaN4         NaN5         NaN6         NaN7         NaN8   -0.8618499   -2.104569dtype: float64

SparseArray

SparseArray is the base layer for all of the sparse indexed datastructures. It is a 1-dimensional ndarray-like object storing only valuesdistinct from thefill_value:

In [12]:arr=np.random.randn(10)In [13]:arr[2:5]=np.nan;arr[7:8]=np.nanIn [14]:sparr=pd.SparseArray(arr)In [15]:sparrOut[15]:[-1.95566352972, -1.6588664276, nan, nan, nan, 1.15893288864, 0.145297113733, nan, 0.606027190513, 1.33421134013]Fill: nanIntIndexIndices: array([0, 1, 5, 6, 8, 9], dtype=int32)

Like the indexed objects (SparseSeries, SparseDataFrame), aSparseArraycan be converted back to a regular ndarray by callingto_dense:

In [16]:sparr.to_dense()Out[16]:array([-1.9557, -1.6589,     nan,     nan,     nan,  1.1589,  0.1453,           nan,  0.606 ,  1.3342])

SparseList

TheSparseList class has been deprecated and will be removed in a future version.See thedocs of a previous versionfor documentation onSparseList.

SparseIndex objects

Two kinds ofSparseIndex are implemented,block andinteger. Werecommend usingblock as it’s more memory efficient. Theinteger formatkeeps an arrays of all of the locations where the data are not equal to thefill value. Theblock format tracks only the locations and sizes of blocksof data.

Sparse Dtypes

Sparse data should have the same dtype as its dense representation. Currently,float64,int64 andbool dtypes are supported. Depending on the originaldtype,fill_value default changes:

  • float64:np.nan
  • int64:0
  • bool:False
In [17]:s=pd.Series([1,np.nan,np.nan])In [18]:sOut[18]:0    1.01    NaN2    NaNdtype: float64In [19]:s.to_sparse()Out[19]:0    1.01    NaN2    NaNdtype: float64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([1], dtype=int32)In [20]:s=pd.Series([1,0,0])In [21]:sOut[21]:0    11    02    0dtype: int64In [22]:s.to_sparse()Out[22]:0    11    02    0dtype: int64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([1], dtype=int32)In [23]:s=pd.Series([True,False,True])In [24]:sOut[24]:0     True1    False2     Truedtype: boolIn [25]:s.to_sparse()Out[25]:0     True1    False2     Truedtype: boolBlockIndexBlock locations: array([0, 2], dtype=int32)Block lengths: array([1, 1], dtype=int32)

You can change the dtype using.astype(), the result is also sparse. Note that.astype() also affects to thefill_value to keep its dense represantation.

In [26]:s=pd.Series([1,0,0,0,0])In [27]:sOut[27]:0    11    02    03    04    0dtype: int64In [28]:ss=s.to_sparse()In [29]:ssOut[29]:0    11    02    03    04    0dtype: int64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([1], dtype=int32)In [30]:ss.astype(np.float64)Out[30]:0    1.01    0.02    0.03    0.04    0.0dtype: float64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([1], dtype=int32)

It raises if any value cannot be coerced to specified dtype.

In [1]:ss=pd.Series([1,np.nan,np.nan]).to_sparse()0    1.01    NaN2    NaNdtype: float64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([1], dtype=int32)In [2]:ss.astype(np.int64)ValueError: unable to coerce current fill_value nan to int64 dtype

Sparse Calculation

You can apply NumPyufuncs toSparseArray and get aSparseArray as a result.

In [31]:arr=pd.SparseArray([1.,np.nan,np.nan,-2.,np.nan])In [32]:np.abs(arr)Out[32]:[1.0, nan, nan, 2.0, nan]Fill: nanIntIndexIndices: array([0, 3], dtype=int32)

Theufunc is also applied tofill_value. This is needed to getthe correct dense result.

In [33]:arr=pd.SparseArray([1.,-1,-1,-2.,-1],fill_value=-1)In [34]:np.abs(arr)Out[34]:[1.0, 1, 1, 2.0, 1]Fill: 1IntIndexIndices: array([0, 3], dtype=int32)In [35]:np.abs(arr).to_dense()Out[35]:array([1.,1.,1.,2.,1.])

Interaction with scipy.sparse

Experimental api to transform between sparse pandas and scipy.sparse structures.

ASparseSeries.to_coo() method is implemented for transforming aSparseSeries indexed by aMultiIndex to ascipy.sparse.coo_matrix.

The method requires aMultiIndex with two or more levels.

In [36]:s=pd.Series([3.0,np.nan,1.0,3.0,np.nan,np.nan])In [37]:s.index=pd.MultiIndex.from_tuples([(1,2,'a',0),   ....:(1,2,'a',1),   ....:(1,1,'b',0),   ....:(1,1,'b',1),   ....:(2,1,'b',0),   ....:(2,1,'b',1)],   ....:names=['A','B','C','D'])   ....:In [38]:sOut[38]:A  B  C  D1  2  a  0    3.0         1    NaN   1  b  0    1.0         1    3.02  1  b  0    NaN         1    NaNdtype: float64# SparseSeriesIn [39]:ss=s.to_sparse()In [40]:ssOut[40]:A  B  C  D1  2  a  0    3.0         1    NaN   1  b  0    1.0         1    3.02  1  b  0    NaN         1    NaNdtype: float64BlockIndexBlock locations: array([0, 2], dtype=int32)Block lengths: array([1, 2], dtype=int32)

In the example below, we transform theSparseSeries to a sparse representation of a 2-d array by specifying that the first and secondMultiIndex levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.

In [41]:A,rows,columns=ss.to_coo(row_levels=['A','B'],   ....:column_levels=['C','D'],   ....:sort_labels=True)   ....:In [42]:AOut[42]:<3x4 sparse matrix of type '<type 'numpy.float64'>'with 3 stored elements in COOrdinate format>In [43]:A.todense()Out[43]:matrix([[ 0.,  0.,  1.,  3.],        [ 3.,  0.,  0.,  0.],        [ 0.,  0.,  0.,  0.]])In [44]:rowsOut[44]:[(1,1),(1,2),(2,1)]In [45]:columnsOut[45]:[('a',0),('a',1),('b',0),('b',1)]

Specifying different row and column labels (and not sorting them) yields a different sparse matrix:

In [46]:A,rows,columns=ss.to_coo(row_levels=['A','B','C'],   ....:column_levels=['D'],   ....:sort_labels=False)   ....:In [47]:AOut[47]:<3x2 sparse matrix of type '<type 'numpy.float64'>'with 3 stored elements in COOrdinate format>In [48]:A.todense()Out[48]:matrix([[ 3.,  0.],        [ 1.,  3.],        [ 0.,  0.]])In [49]:rowsOut[49]:[(1,2,'a'),(1,1,'b'),(2,1,'b')]In [50]:columnsOut[50]:[0,1]

A convenience methodSparseSeries.from_coo() is implemented for creating aSparseSeries from ascipy.sparse.coo_matrix.

In [51]:fromscipyimportsparseIn [52]:A=sparse.coo_matrix(([3.0,1.0,2.0],([1,0,0],[0,2,3])),   ....:shape=(3,4))   ....:In [53]:AOut[53]:<3x4 sparse matrix of type '<type 'numpy.float64'>'with 3 stored elements in COOrdinate format>In [54]:A.todense()Out[54]:matrix([[ 0.,  0.,  1.,  2.],        [ 3.,  0.,  0.,  0.],        [ 0.,  0.,  0.,  0.]])

The default behaviour (withdense_index=False) simply returns aSparseSeries containingonly the non-null entries.

In [55]:ss=pd.SparseSeries.from_coo(A)In [56]:ssOut[56]:0  2    1.0   3    2.01  0    3.0dtype: float64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([3], dtype=int32)

Specifyingdense_index=True will result in an index that is the Cartesian product of therow and columns coordinates of the matrix. Note that this will consume a significant amount of memory(relative todense_index=False) if the sparse matrix is large (and sparse) enough.

In [57]:ss_dense=pd.SparseSeries.from_coo(A,dense_index=True)In [58]:ss_denseOut[58]:0  0    NaN   1    NaN   2    1.0   3    2.01  0    3.0   1    NaN   2    NaN   3    NaN2  0    NaN   1    NaN   2    NaN   3    NaNdtype: float64BlockIndexBlock locations: array([2], dtype=int32)Block lengths: array([3], dtype=int32)

Navigation

Scroll To Top
[8]ページ先頭

©2009-2025 Movatter.jp