Partial Selection on Hierarichal Index Not Working as Expected Beyond 2 Levels #2995

New issue

Closed

Partial Selection on Hierarichal Index Not Working as Expected Beyond 2 Levels#2995

Description

dragoljub

opened

on Mar 9, 2013

The Pandas documentation states that a hierarchically indexed DataFrame can be partially selected from by passing a partial list of index values. For example if we have a 5-level index, then df.ix['A','B','C','D'] should return a DataFrame with the subset of rows where levels[0,1,2,3] match values ['A','B','C','D'] indexed with by the remaining lowest index level 5. I have found that this only works when partially selecting on the first two levels, anything beyond the first 2 levels requires the user to append a null slice at the end for .ix to work. This becomes cumbersome especially if you do not know exactly how many levels are available in a DataFrame after several processing steps. The same behavior persists regardless if your MultiIndex is int64 or string or mixed types.

It would be best if any partial tuple of index values passed to the .ix method returns a partial selection matching that tuple without requiring users to explicitly append the empty slice for the remaining levels. Let me know if this is how the .ix method is intended to work. It might be a simple fix of by checking if the tuple length == # levels and if not appending the empty slice within the .ix method.

Example Showing the Issue:

Take an example data set with 5 index levels, group by the first 4 levels and count the group sizes. Select the group with the largest size.

Results: (Clean Code Pasted Below IPython Output)

In [1]:importnumpyasnpIn [2]:importpandasaspdIn [3]:printpd.__version__0.11.0.dev-80945b6In [4]:# Generate Test DataFrame   ...:NUM_ROWS=100000   ...:In [5]:NUM_COLS=10In [6]:col_names= ['A'+numfornuminmap(str,np.arange(NUM_COLS).tolist())]In [7]:index_cols=col_names[:5]In [8]:# Set DataFrame to have 5 level Hierarchical Index.   ...:# The dtype does not matter try str or np.int64 same results.   ...:df=pd.DataFrame(np.random.randint(5,size=(NUM_ROWS,NUM_COLS)),dtype=np.int64,columns=col_names)   ...:In [9]:df=df.set_index(index_cols)In [10]:dfOut[10]:<class'pandas.core.frame.DataFrame'>MultiIndex:100000entries, (2,2,4,0,0)to (3,2,3,1,1)Datacolumns:A5100000non-nullvaluesA6100000non-nullvaluesA7100000non-nullvaluesA8100000non-nullvaluesA9100000non-nullvaluesdtypes:int64(5)In [11]:# Group by first 4 index columns.   ....:grp=df.groupby(level=index_cols[:-1])   ....:In [12]:# Compute group size and find index of largest group.   ....:grp_size=grp.size()   ....:In [13]:grp_size.sort()In [14]:grp_size[::-1]Out[14]:A0A1A2A3212120014132003141199411119624201904401190310018900431894121188014218722301873232187340186130186334185...01141362443135441213524321353341134423113332241334013133043413112101304414130214128023212733011211023106Length:625,dtype:int64In [15]:loc=grp.size().idxmax()In [16]:locOut[16]: (1,4,1,3)In [17]:# The following line does not work.   ....:# Can't use partial selection to select largest group.   ....:df.ix[loc]   ....:---------------------------------------------------------------------------IndexingErrorTraceback (mostrecentcalllast)<ipython-input-17-e26047d119d7>in<module>()1# The following line does not work.2# Can't use partial selection to select largest group.---->3df.ix[loc]4C:\Python27\lib\site-packages\pandas\core\indexing.pycin__getitem__(self,key)44pass45--->46returnself._getitem_tuple(key)47else:48returnself._getitem_axis(key,axis=0)C:\Python27\lib\site-packages\pandas\core\indexing.pycin_getitem_tuple(self,tup)235fori,keyinenumerate(tup):236ifi>=self.obj.ndim:-->237raiseIndexingError('Too many indexers')238239if_is_null_slice(key):IndexingError:ToomanyindexersIn [18]:# Appending the null slice at the end works!   ....:# I was hoping for partial selection to work above.   ....:df.ix[loc+(slice(None),)]   ....:Out[18]:<class'pandas.core.frame.DataFrame'>Int64Index:200entries,1to0Datacolumns:A5200non-nullvaluesA6200non-nullvaluesA7200non-nullvaluesA8200non-nullvaluesA9200non-nullvaluesdtypes:int64(5)In [19]:# Selecting on first level works.   ....:df.ix[loc[0]]   ....:Out[19]:<class'pandas.core.frame.DataFrame'>MultiIndex:19796entries, (3,1,1,1)to (4,1,3,0)Datacolumns:A519796non-nullvaluesA619796non-nullvaluesA719796non-nullvaluesA819796non-nullvaluesA919796non-nullvaluesdtypes:int64(5)In [20]:# Selecting on first two levels works.   ....:df.ix[loc[0],loc[1]]   ....:Out[20]:A1A2A3A4311130110240134314114444003241123143434212031202100303242244041224413034...024001421402143423211323330312124344220023244133134222414311111013141303Name:A9,Length:19796,dtype:int64In [21]:# Selecting on first 3 levels fails.   ....:df.ix[loc[0],loc[1],loc[2]]   ....:---------------------------------------------------------------------------IndexingErrorTraceback (mostrecentcalllast)<ipython-input-21-38eb049081e0>in<module>()1# Selecting on first 3 levels fails.---->2df.ix[loc[0],loc[1],loc[2]]3C:\Python27\lib\site-packages\pandas\core\indexing.pycin__getitem__(self,key)44pass45--->46returnself._getitem_tuple(key)47else:48returnself._getitem_axis(key,axis=0)C:\Python27\lib\site-packages\pandas\core\indexing.pycin_getitem_tuple(self,tup)235fori,keyinenumerate(tup):236ifi>=self.obj.ndim:-->237raiseIndexingError('Too many indexers')238239if_is_null_slice(key):IndexingError:ToomanyindexersIn [22]:# Adding the null slice and it works.   ....:df.ix[loc[0],loc[1],loc[2],slice(None)]Out[22]:<class'pandas.core.frame.DataFrame'>MultiIndex:848entries, (2,1)to (3,0)Datacolumns:A5848non-nullvaluesA6848non-nullvaluesA7848non-nullvaluesA8848non-nullvaluesA9848non-nullvaluesdtypes:int64(5)

Code:

importnumpyasnpimportpandasaspdprintpd.__version__# Generate Test DataFrameNUM_ROWS=100000NUM_COLS=10col_names= ['A'+numfornuminmap(str,np.arange(NUM_COLS).tolist())]index_cols=col_names[:5]# Set DataFrame to have 5 level Hierarchical Index.# The dtype does not matter try str or np.int64 same results.df=pd.DataFrame(np.random.randint(5,size=(NUM_ROWS,NUM_COLS)),dtype=np.int64,columns=col_names)df=df.set_index(index_cols)df# Group by first 4 index columns.grp=df.groupby(level=index_cols[:-1])# Compute group size and find index of largest group.grp_size=grp.size()grp_size.sort()grp_size[::-1]loc=grp.size().idxmax()loc# The following line does not work.# Can't use partial selection to select largest group.df.ix[loc]# Appending the null slice at the end works!# I was hoping for partial selection to work above.df.ix[loc+(slice(None),)]# Selecting on first level works.df.ix[loc[0]]# Selecting on first two levels works.df.ix[loc[0],loc[1]]# Selecting on first 3 levels fails.df.ix[loc[0],loc[1],loc[2]]# Adding the null slice and it works.df.ix[loc[0],loc[1],loc[2],slice(None)]

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Partial Selection on Hierarichal Index Not Working as Expected Beyond 2 Levels #2995

Description

Example Showing the Issue:

Code:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions