Enter search terms or a module, class or function name.
This section covers indexing with aMultiIndex and more advanced indexing features.
See theIndexing and Selecting Data for general indexing documentation.
Warning
Whether a copy or a reference is returned for a setting operation, maydepend on the context. This is sometimes calledchainedassignment andshould be avoided. SeeReturning a View versus Copy
Warning
In 0.15.0Index has internally been refactored to no longer sub-classndarraybut instead subclassPandasObject, similarly to the rest of the pandas objects. This should bea transparent change with only very limited API implications (See theInternal Refactoring)
See thecookbook for some advanced strategies
Hierarchical / Multi-level indexing is very exciting as it opens the door to somequite sophisticated data analysis and manipulation, especially for working withhigher dimensional data. In essence, it enables you to store and manipulatedata with an arbitrary number of dimensions in lower dimensional datastructures like Series (1d) and DataFrame (2d).
In this section, we will show what exactly we mean by “hierarchical” indexingand how it integrates with the all of the pandas indexing functionalitydescribed above and in prior sections. Later, when discussinggroup by andpivoting and reshaping data, we’ll shownon-trivial applications to illustrate how it aids in structuring data foranalysis.
See thecookbook for some advanced strategies
TheMultiIndex object is the hierarchical analogue of the standardIndex object which typically stores the axis labels in pandas objects. Youcan think ofMultiIndex an array of tuples where each tuple is unique. AMultiIndex can be created from a list of arrays (usingMultiIndex.from_arrays), an array of tuples (usingMultiIndex.from_tuples), or a crossed set of iterables (usingMultiIndex.from_product). TheIndex constructor will attempt to returnaMultiIndex when it is passed a list of tuples. The following examplesdemo different ways to initialize MultiIndexes.
In [1]:arrays=[['bar','bar','baz','baz','foo','foo','qux','qux'], ...:['one','two','one','two','one','two','one','two']] ...:In [2]:tuples=list(zip(*arrays))In [3]:tuplesOut[3]:[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]In [4]:index=pd.MultiIndex.from_tuples(tuples,names=['first','second'])In [5]:indexOut[5]:MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']], labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]], names=[u'first', u'second'])In [6]:s=pd.Series(np.random.randn(8),index=index)In [7]:sOut[7]:first secondbar one 0.469112 two -0.282863baz one -1.509059 two -1.135632foo one 1.212112 two -0.173215qux one 0.119209 two -1.044236dtype: float64
When you want every pairing of the elements in two iterables, it can be easierto use theMultiIndex.from_product function:
In [8]:iterables=[['bar','baz','foo','qux'],['one','two']]In [9]:pd.MultiIndex.from_product(iterables,names=['first','second'])Out[9]:MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']], labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]], names=[u'first', u'second'])
As a convenience, you can pass a list of arrays directly into Series orDataFrame to construct a MultiIndex automatically:
In [10]:arrays=[np.array(['bar','bar','baz','baz','foo','foo','qux','qux']), ....:np.array(['one','two','one','two','one','two','one','two'])] ....:In [11]:s=pd.Series(np.random.randn(8),index=arrays)In [12]:sOut[12]:bar one -0.861849 two -2.104569baz one -0.494929 two 1.071804foo one 0.721555 two -0.706771qux one -1.039575 two 0.271860dtype: float64In [13]:df=pd.DataFrame(np.random.randn(8,4),index=arrays)In [14]:dfOut[14]: 0 1 2 3bar one -0.424972 0.567020 0.276232 -1.087401 two -0.673690 0.113648 -1.478427 0.524988baz one 0.404705 0.577046 -1.715002 -1.039268 two -0.370647 -1.157892 -1.344312 0.844885foo one 1.075770 -0.109050 1.643563 -1.469388 two 0.357021 -0.674600 -1.776904 -0.968914qux one -1.294524 0.413738 0.276662 -0.472035 two -0.013960 -0.362543 -0.006154 -0.923061
All of theMultiIndex constructors accept anames argument which storesstring names for the levels themselves. If no names are provided,None willbe assigned:
In [15]:df.index.namesOut[15]:FrozenList([None,None])
This index can back any axis of a pandas object, and the number oflevelsof the index is up to you:
In [16]:df=pd.DataFrame(np.random.randn(3,8),index=['A','B','C'],columns=index)In [17]:dfOut[17]:first bar baz foo qux \second one two one two one two oneA 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466firstsecond twoA -0.226169B -1.436737C -2.006747In [18]:pd.DataFrame(np.random.randn(6,6),index=index[:6],columns=index[:6])Out[18]:first bar baz foosecond one two one two one twofirst secondbar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804 two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738 two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232 two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441
We’ve “sparsified” the higher levels of the indexes to make the console output abit easier on the eyes.
It’s worth keeping in mind that there’s nothing preventing you from usingtuples as atomic labels on an axis:
In [19]:pd.Series(np.random.randn(8),index=tuples)Out[19]:(bar, one) -1.236269(bar, two) 0.896171(baz, one) -0.487602(baz, two) -0.082240(foo, one) -2.182937(foo, two) 0.380396(qux, one) 0.084844(qux, two) 0.432390dtype: float64
The reason that theMultiIndex matters is that it can allow you to dogrouping, selection, and reshaping operations as we will describe below and insubsequent areas of the documentation. As you will see in later sections, youcan find yourself working with hierarchically-indexed data without creating aMultiIndex explicitly yourself. However, when loading data from a file, youmay wish to generate your ownMultiIndex when preparing the data set.
Note that how the index is displayed by be controlled using themulti_sparse option inpandas.set_printoptions:
In [20]:pd.set_option('display.multi_sparse',False)In [21]:dfOut[21]:first bar bar baz baz foo foo qux \second one two one two one two oneA 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466first quxsecond twoA -0.226169B -1.436737C -2.006747In [22]:pd.set_option('display.multi_sparse',True)
The methodget_level_values will return a vector of the labels for eachlocation at a particular level:
In [23]:index.get_level_values(0)Out[23]:Index([u'bar',u'bar',u'baz',u'baz',u'foo',u'foo',u'qux',u'qux'],dtype='object',name=u'first')In [24]:index.get_level_values('second')Out[24]:Index([u'one',u'two',u'one',u'two',u'one',u'two',u'one',u'two'],dtype='object',name=u'second')
One of the important features of hierarchical indexing is that you can selectdata by a “partial” label identifying a subgroup in the data.Partialselection “drops” levels of the hierarchical index in the result in acompletely analogous way to selecting a column in a regular DataFrame:
In [25]:df['bar']Out[25]:second one twoA 0.895717 0.805244B 0.410835 0.813850C -1.413681 1.607920In [26]:df['bar','one']Out[26]:A 0.895717B 0.410835C -1.413681Name: (bar, one), dtype: float64In [27]:df['bar']['one']Out[27]:A 0.895717B 0.410835C -1.413681Name: one, dtype: float64In [28]:s['qux']Out[28]:one -1.039575two 0.271860dtype: float64
SeeCross-section with hierarchical index for how to selecton a deeper level.
Note
The repr of aMultiIndex shows ALL the defined levels of an index, evenif the they are not actually used. When slicing an index, you may notice this.For example:
# original multi-indexIn [29]:df.columnsOut[29]:MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']], labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]], names=[u'first', u'second'])# slicedIn [30]:df[['foo','qux']].columnsOut[30]:MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']], labels=[[2, 2, 3, 3], [0, 1, 0, 1]], names=[u'first', u'second'])
This is done to avoid a recomputation of the levels in order to make slicinghighly performant. If you want to see the actual used levels.
In [31]:df[['foo','qux']].columns.valuesOut[31]:array([('foo','one'),('foo','two'),('qux','one'),('qux','two')],dtype=object)# for a specific levelIn [32]:df[['foo','qux']].columns.get_level_values(0)Out[32]:Index([u'foo',u'foo',u'qux',u'qux'],dtype='object',name=u'first')
To reconstruct the multiindex with only the used levels
In [33]:pd.MultiIndex.from_tuples(df[['foo','qux']].columns.values)Out[33]:MultiIndex(levels=[[u'foo', u'qux'], [u'one', u'two']], labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
reindex¶Operations between differently-indexed objects havingMultiIndex on theaxes will work as you expect; data alignment will work the same as an Index oftuples:
In [34]:s+s[:-2]Out[34]:bar one -1.723698 two -4.209138baz one -0.989859 two 2.143608foo one 1.443110 two -1.413542qux one NaN two NaNdtype: float64In [35]:s+s[::2]Out[35]:bar one -1.723698 two NaNbaz one -0.989859 two NaNfoo one 1.443110 two NaNqux one -2.079150 two NaNdtype: float64
reindex can be called with anotherMultiIndex or even a list or arrayof tuples:
In [36]:s.reindex(index[:3])Out[36]:first secondbar one -0.861849 two -2.104569baz one -0.494929dtype: float64In [37]:s.reindex([('foo','two'),('bar','one'),('qux','one'),('baz','one')])Out[37]:foo two -0.706771bar one -0.861849qux one -1.039575baz one -0.494929dtype: float64
Syntactically integratingMultiIndex in advanced indexing with.loc/.ix is abit challenging, but we’ve made every effort to do so. for example thefollowing works as you would expect:
In [38]:df=df.TIn [39]:dfOut[39]: A B Cfirst secondbar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747In [40]:df.loc['bar']Out[40]: A B Csecondone 0.895717 0.410835 -1.413681two 0.805244 0.813850 1.607920In [41]:df.loc['bar','two']Out[41]:A 0.805244B 0.813850C 1.607920Name: (bar, two), dtype: float64
“Partial” slicing also works quite nicely.
In [42]:df.loc['baz':'foo']Out[42]: A B Cfirst secondbaz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372
You can slice with a ‘range’ of values, by providing a slice of tuples.
In [43]:df.loc[('baz','two'):('qux','one')]Out[43]: A B Cfirst secondbaz two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372qux one -1.170299 1.130127 0.974466In [44]:df.loc[('baz','two'):'foo']Out[44]: A B Cfirst secondbaz two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372
Passing a list of labels or tuples works similar to reindexing:
In [45]:df.ix[[('bar','two'),('qux','one')]]Out[45]: A B Cfirst secondbar two 0.805244 0.813850 1.607920qux one -1.170299 1.130127 0.974466
New in version 0.14.0.
In 0.14.0 we added a new way to slice multi-indexed objects.You can slice a multi-index by providing multiple indexers.
You can provide any of the selectors as if you are indexing by label, seeSelection by Label,including slices, lists of labels, labels, and boolean indexers.
You can useslice(None) to select all the contents ofthat level. You do not need to specify all thedeeper levels, they will be implied asslice(None).
As usual,both sides of the slicers are included as this is label indexing.
Warning
You should specify all axes in the.loc specifier, meaning the indexer for theindex andfor thecolumns. There are some ambiguous cases where the passed indexer could be mis-interpretedas indexingboth axes, rather than into say the MuliIndex for the rows.
You should do this:
df.loc[(slice('A1','A3'),.....),:]
rather than this:
df.loc[(slice('A1','A3'),.....)]
In [46]:defmklbl(prefix,n): ....:return["%s%s"%(prefix,i)foriinrange(n)] ....:In [47]:miindex=pd.MultiIndex.from_product([mklbl('A',4), ....:mklbl('B',2), ....:mklbl('C',4), ....:mklbl('D',2)]) ....:In [48]:micolumns=pd.MultiIndex.from_tuples([('a','foo'),('a','bar'), ....:('b','foo'),('b','bah')], ....:names=['lvl0','lvl1']) ....:In [49]:dfmi=pd.DataFrame(np.arange(len(miindex)*len(micolumns)).reshape((len(miindex),len(micolumns))), ....:index=miindex, ....:columns=micolumns).sort_index().sort_index(axis=1) ....:In [50]:dfmiOut[50]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9 8 11 10 D1 13 12 15 14 C2 D0 17 16 19 18 D1 21 20 23 22 C3 D0 25 24 27 26... ... ... ... ...A3 B1 C0 D1 229 228 231 230 C1 D0 233 232 235 234 D1 237 236 239 238 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249 248 251 250 D1 253 252 255 254[64 rows x 4 columns]
Basic multi-index slicing using slices, lists, and labels.
In [51]:dfmi.loc[(slice('A1','A3'),slice(None),['C1','C3']),:]Out[51]:lvl0 a blvl1 bar foo bah fooA1 B0 C1 D0 73 72 75 74 D1 77 76 79 78 C3 D0 89 88 91 90 D1 93 92 95 94 B1 C1 D0 105 104 107 106 D1 109 108 111 110 C3 D0 121 120 123 122... ... ... ... ...A3 B0 C1 D1 205 204 207 206 C3 D0 217 216 219 218 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254[24 rows x 4 columns]
You can use apd.IndexSlice to have a more natural syntax using: rather than usingslice(None)
In [52]:idx=pd.IndexSliceIn [53]:dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']]Out[53]:lvl0 a blvl1 foo fooA0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 D1 44 46 C3 D0 56 58... ... ...A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254[32 rows x 2 columns]
It is possible to perform quite complicated selections using this method on multipleaxes at the same time.
In [54]:dfmi.loc['A1',(slice(None),'foo')]Out[54]:lvl0 a blvl1 foo fooB0 C0 D0 64 66 D1 68 70 C1 D0 72 74 D1 76 78 C2 D0 80 82 D1 84 86 C3 D0 88 90... ... ...B1 C0 D1 100 102 C1 D0 104 106 D1 108 110 C2 D0 112 114 D1 116 118 C3 D0 120 122 D1 124 126[16 rows x 2 columns]In [55]:dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']]Out[55]:lvl0 a blvl1 foo fooA0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 D1 44 46 C3 D0 56 58... ... ...A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254[32 rows x 2 columns]
Using a boolean indexer you can provide selection related to thevalues.
In [56]:mask=dfmi[('a','foo')]>200In [57]:dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]Out[57]:lvl0 a blvl1 foo fooA3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254
You can also specify theaxis argument to.loc to interpret the passedslicers on a single axis.
In [58]:dfmi.loc(axis=0)[:,:,['C1','C3']]Out[58]:lvl0 a blvl1 bar foo bah fooA0 B0 C1 D0 9 8 11 10 D1 13 12 15 14 C3 D0 25 24 27 26 D1 29 28 31 30 B1 C1 D0 41 40 43 42 D1 45 44 47 46 C3 D0 57 56 59 58... ... ... ... ...A3 B0 C1 D1 205 204 207 206 C3 D0 217 216 219 218 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254[32 rows x 4 columns]
Furthermore you canset the values using these methods
In [59]:df2=dfmi.copy()In [60]:df2.loc(axis=0)[:,:,['C1','C3']]=-10In [61]:df2Out[61]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 17 16 19 18 D1 21 20 23 22 C3 D0 -10 -10 -10 -10... ... ... ... ...A3 B1 C0 D1 229 228 231 230 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10[64 rows x 4 columns]
You can use a right-hand-side of an alignable object as well.
In [62]:df2=dfmi.copy()In [63]:df2.loc[idx[:,:,['C1','C3']],:]=df2*1000In [64]:df2Out[64]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9000 8000 11000 10000 D1 13000 12000 15000 14000 C2 D0 17 16 19 18 D1 21 20 23 22 C3 D0 25000 24000 27000 26000... ... ... ... ...A3 B1 C0 D1 229 228 231 230 C1 D0 233000 232000 235000 234000 D1 237000 236000 239000 238000 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249000 248000 251000 250000 D1 253000 252000 255000 254000[64 rows x 4 columns]
Thexs method ofDataFrame additionally takes a level argument to makeselecting data at a particular level of a MultiIndex easier.
In [65]:dfOut[65]: A B Cfirst secondbar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747In [66]:df.xs('one',level='second')Out[66]: A B Cfirstbar 0.895717 0.410835 -1.413681baz -1.206412 0.132003 1.024180foo 1.431256 -0.076467 0.875906qux -1.170299 1.130127 0.974466
# using the slicers (new in 0.14.0)In [67]:df.loc[(slice(None),'one'),:]Out[67]: A B Cfirst secondbar one 0.895717 0.410835 -1.413681baz one -1.206412 0.132003 1.024180foo one 1.431256 -0.076467 0.875906qux one -1.170299 1.130127 0.974466
You can also select on the columns withxs(), byproviding the axis argument
In [68]:df=df.TIn [69]:df.xs('one',level='second',axis=1)Out[69]:first bar baz foo quxA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
# using the slicers (new in 0.14.0)In [70]:df.loc[:,(slice(None),'one')]Out[70]:first bar baz foo quxsecond one one one oneA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
xs() also allows selection with multiple keys
In [71]:df.xs(('one','bar'),level=('second','first'),axis=1)Out[71]:first barsecond oneA 0.895717B 0.410835C -1.413681
# using the slicers (new in 0.14.0)In [72]:df.loc[:,('bar','one')]Out[72]:A 0.895717B 0.410835C -1.413681Name: (bar, one), dtype: float64
New in version 0.13.0.
You can passdrop_level=False toxs() to retainthe level that was selected
In [73]:df.xs('one',level='second',axis=1,drop_level=False)Out[73]:first bar baz foo quxsecond one one one oneA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
versus the result withdrop_level=True (the default value)
In [74]:df.xs('one',level='second',axis=1,drop_level=True)Out[74]:first bar baz foo quxA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
The parameterlevel has been added to thereindex andalign methodsof pandas objects. This is useful to broadcast values across a level. Forinstance:
In [75]:midx=pd.MultiIndex(levels=[['zero','one'],['x','y']], ....:labels=[[1,1,0,0],[1,0,1,0]]) ....:In [76]:df=pd.DataFrame(np.random.randn(4,2),index=midx)In [77]:dfOut[77]: 0 1one y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520In [78]:df2=df.mean(level=0)In [79]:df2Out[79]: 0 1zero 1.271532 0.713416one 1.060074 -0.109716In [80]:df2.reindex(df.index,level=0)Out[80]: 0 1one y 1.060074 -0.109716 x 1.060074 -0.109716zero y 1.271532 0.713416 x 1.271532 0.713416# aligningIn [81]:df_aligned,df2_aligned=df.align(df2,level=0)In [82]:df_alignedOut[82]: 0 1one y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520In [83]:df2_alignedOut[83]: 0 1one y 1.060074 -0.109716 x 1.060074 -0.109716zero y 1.271532 0.713416 x 1.271532 0.713416
swaplevel()¶Theswaplevel function can switch the order of two levels:
In [84]:df[:5]Out[84]: 0 1one y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520In [85]:df[:5].swaplevel(0,1,axis=0)Out[85]: 0 1y one 1.519970 -0.493662x one 0.600178 0.274230y zero 0.132885 -0.023688x zero 2.410179 1.450520
reorder_levels()¶Thereorder_levels function generalizes theswaplevel function,allowing you to permute the hierarchical index levels in one step:
In [86]:df[:5].reorder_levels([1,0],axis=0)Out[86]: 0 1y one 1.519970 -0.493662x one 0.600178 0.274230y zero 0.132885 -0.023688x zero 2.410179 1.450520
MultiIndex¶For MultiIndex-ed objects to be indexed & sliced effectively, they needto be sorted. As with any index, you can usesort_index.
In [87]:importrandom;random.shuffle(tuples)In [88]:s=pd.Series(np.random.randn(8),index=pd.MultiIndex.from_tuples(tuples))In [89]:sOut[89]:qux one 0.206053baz two -0.251905bar one -2.213588 two 1.063327foo one 1.266143 two 0.299368baz one -0.863838qux two 0.408204dtype: float64In [90]:s.sort_index()Out[90]:bar one -2.213588 two 1.063327baz one -0.863838 two -0.251905foo one 1.266143 two 0.299368qux one 0.206053 two 0.408204dtype: float64In [91]:s.sort_index(level=0)Out[91]:bar one -2.213588 two 1.063327baz one -0.863838 two -0.251905foo one 1.266143 two 0.299368qux one 0.206053 two 0.408204dtype: float64In [92]:s.sort_index(level=1)Out[92]:bar one -2.213588baz one -0.863838foo one 1.266143qux one 0.206053bar two 1.063327baz two -0.251905foo two 0.299368qux two 0.408204dtype: float64
You may also pass a level name tosort_index if the MultiIndex levelsare named.
In [93]:s.index.set_names(['L1','L2'],inplace=True)In [94]:s.sort_index(level='L1')Out[94]:L1 L2bar one -2.213588 two 1.063327baz one -0.863838 two -0.251905foo one 1.266143 two 0.299368qux one 0.206053 two 0.408204dtype: float64In [95]:s.sort_index(level='L2')Out[95]:L1 L2bar one -2.213588baz one -0.863838foo one 1.266143qux one 0.206053bar two 1.063327baz two -0.251905foo two 0.299368qux two 0.408204dtype: float64
On higher dimensional objects, you can sort any of the other axes by level ifthey have a MultiIndex:
In [96]:df.T.sort_index(level=1,axis=1)Out[96]: zero one zero one x x y y0 2.410179 0.600178 0.132885 1.5199701 1.450520 0.274230 -0.023688 -0.493662
Indexing will work even if the data are not sorted, but will be ratherinefficient (and show aPerformanceWarning). It will alsoreturn a copy of the data rather than a view:
In [97]:dfm=pd.DataFrame({'jim':[0,0,1,1], ....:'joe':['x','x','z','y'], ....:'jolie':np.random.rand(4)}) ....:In [98]:dfm=dfm.set_index(['jim','joe'])In [99]:dfmOut[99]: joliejim joe0 x 0.490671 x 0.1202481 z 0.537020 y 0.110968
In [4]:dfm.loc[(1,'z')]PerformanceWarning: indexing past lexsort depth may impact performance.Out[4]: joliejim joe1 z 0.64094
Furthermore if you try to index something that is not fully lexsorted, this can raise:
In [5]:dfm.loc[(0,'y'):(1,'z')]KeyError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
Theis_lexsorted() method on anIndex show if the index is sorted, and thelexsort_depth property returns the sort depth:
In [100]:dfm.index.is_lexsorted()Out[100]:FalseIn [101]:dfm.index.lexsort_depthOut[101]:1
In [102]:dfm=dfm.sort_index()In [103]:dfmOut[103]: joliejim joe0 x 0.490671 x 0.1202481 y 0.110968 z 0.537020In [104]:dfm.index.is_lexsorted()Out[104]:TrueIn [105]:dfm.index.lexsort_depthOut[105]:2
And now selection works as expected.
In [106]:dfm.loc[(0,'y'):(1,'z')]Out[106]: joliejim joe1 y 0.110968 z 0.537020
Similar to numpy ndarrays, pandas Index, Series, and DataFrame also providesthetake method that retrieves elements along a given axis at the givenindices. The given indices must be either a list or an ndarray of integerindex positions.take will also accept negative integers as relative positions to the end of the object.
In [107]:index=pd.Index(np.random.randint(0,1000,10))In [108]:indexOut[108]:Int64Index([214,502,712,567,786,175,993,133,758,329],dtype='int64')In [109]:positions=[0,9,3]In [110]:index[positions]Out[110]:Int64Index([214,329,567],dtype='int64')In [111]:index.take(positions)Out[111]:Int64Index([214,329,567],dtype='int64')In [112]:ser=pd.Series(np.random.randn(10))In [113]:ser.iloc[positions]Out[113]:0 -0.1796669 1.8243753 0.392149dtype: float64In [114]:ser.take(positions)Out[114]:0 -0.1796669 1.8243753 0.392149dtype: float64
For DataFrames, the given indices should be a 1d list or ndarray that specifiesrow or column positions.
In [115]:frm=pd.DataFrame(np.random.randn(5,3))In [116]:frm.take([1,4,3])Out[116]: 0 1 21 -1.237881 0.106854 -1.2768294 0.629675 -1.425966 1.8577043 0.979542 -1.633678 0.615855In [117]:frm.take([0,2],axis=1)Out[117]: 0 20 0.595974 0.6015441 -1.237881 -1.2768292 -0.767101 1.4995913 0.979542 0.6158554 0.629675 1.857704
It is important to note that thetake method on pandas objects are notintended to work on boolean indices and may return unexpected results.
In [118]:arr=np.random.randn(10)In [119]:arr.take([False,False,True,True])Out[119]:array([-1.1935,-1.1935,0.6775,0.6775])In [120]:arr[[0,1]]Out[120]:array([-1.1935,0.6775])In [121]:ser=pd.Series(np.random.randn(10))In [122]:ser.take([False,False,True,True])Out[122]:0 0.2331410 0.2331411 -0.2235401 -0.223540dtype: float64In [123]:ser.ix[[0,1]]Out[123]:0 0.2331411 -0.223540dtype: float64
Finally, as a small note on performance, because thetake method handlesa narrower range of inputs, it can offer performance that is a good dealfaster than fancy indexing.
We have discussedMultiIndex in the previous sections pretty extensively.DatetimeIndex andPeriodIndexare shownhere.TimedeltaIndex arehere.
In the following sub-sections we will highlite some other index types.
New in version 0.16.1.
We introduce aCategoricalIndex, a new type of index object that is useful for supportingindexing with duplicates. This is a container around aCategorical (introduced in v0.15.0)and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,setting the index of aDataFrame/Series with acategory dtype would convert this to regular object-basedIndex.
In [124]:df=pd.DataFrame({'A':np.arange(6), .....:'B':list('aabbca')}) .....:In [125]:df['B']=df['B'].astype('category',categories=list('cab'))In [126]:dfOut[126]: A B0 0 a1 1 a2 2 b3 3 b4 4 c5 5 aIn [127]:df.dtypesOut[127]:A int64B categorydtype: objectIn [128]:df.B.cat.categoriesOut[128]:Index([u'c',u'a',u'b'],dtype='object')
Setting the index, will create create aCategoricalIndex
In [129]:df2=df.set_index('B')In [130]:df2.indexOut[130]:CategoricalIndex([u'a',u'a',u'b',u'b',u'c',u'a'],categories=[u'c',u'a',u'b'],ordered=False,name=u'B',dtype='category')
Indexing with__getitem__/.iloc/.loc/.ix works similarly to anIndex with duplicates.The indexers MUST be in the category or the operation will raise.
In [131]:df2.loc['a']Out[131]: ABa 0a 1a 5
These PRESERVE theCategoricalIndex
In [132]:df2.loc['a'].indexOut[132]:CategoricalIndex([u'a',u'a',u'a'],categories=[u'c',u'a',u'b'],ordered=False,name=u'B',dtype='category')
Sorting will order by the order of the categories
In [133]:df2.sort_index()Out[133]: ABc 4a 0a 1a 5b 2b 3
Groupby operations on the index will preserve the index nature as well
In [134]:df2.groupby(level=0).sum()Out[134]: ABc 4a 6b 5In [135]:df2.groupby(level=0).sum().indexOut[135]:CategoricalIndex([u'c',u'a',u'b'],categories=[u'c',u'a',u'b'],ordered=False,name=u'B',dtype='category')
Reindexing operations, will return a resulting index based on the type of the passedindexer, meaning that passing a list will return a plain-old-Index; indexing withaCategorical will return aCategoricalIndex, indexed according to the categoriesof the PASSEDCategorical dtype. This allows one to arbitrarly index these even withvalues NOT in the categories, similarly to how you can reindex ANY pandas index.
In [136]:df2.reindex(['a','e'])Out[136]: ABa 0.0a 1.0a 5.0e NaNIn [137]:df2.reindex(['a','e']).indexOut[137]:Index([u'a',u'a',u'a',u'e'],dtype='object',name=u'B')In [138]:df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))Out[138]: ABa 0.0a 1.0a 5.0e NaNIn [139]:df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).indexOut[139]:CategoricalIndex([u'a',u'a',u'a',u'e'],categories=[u'a',u'b',u'c',u'd',u'e'],ordered=False,name=u'B',dtype='category')
Warning
Reshaping and Comparison operations on aCategoricalIndex must have the same categoriesor aTypeError will be raised.
In[9]:df3=pd.DataFrame({'A':np.arange(6),'B':pd.Series(list('aabbca')).astype('category')})In[11]:df3=df3.set_index('B')In[11]:df3.indexOut[11]:CategoricalIndex([u'a',u'a',u'b',u'b',u'c',u'a'],categories=[u'a',u'b',u'c'],ordered=False,name=u'B',dtype='category')In[12]:pd.concat([df2,df3]TypeError:categoriesmustmatchexistingcategorieswhenappending
Warning
Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, seehere.
Int64Index is a fundamental basic index inpandas. This is an Immutable array implementing an ordered, sliceable set.Prior to 0.18.0, theInt64Index would provide the default index for allNDFrame objects.
RangeIndex is a sub-class ofInt64Index added in version 0.18.0, now providing the default index for allNDFrame objects.RangeIndex is an optimized version ofInt64Index that can represent a monotonic ordered set. These are analagous to pythonrange types.
Note
As of 0.14.0,Float64Index is backed by a nativefloat64 dtypearray. Prior to 0.14.0,Float64Index was backed by anobject dtypearray. Using afloat64 dtype in the backend speeds up arithmeticoperations by about 30x and boolean indexing operations on theFloat64Index itself are about 2x as fast.
New in version 0.13.0.
By default aFloat64Index will be automatically created when passing floating, or mixed-integer-floating values in index creation.This enables a pure label-based slicing paradigm that makes[],ix,loc for scalar indexing and slicing work exactly thesame.
In [140]:indexf=pd.Index([1.5,2,3,4.5,5])In [141]:indexfOut[141]:Float64Index([1.5,2.0,3.0,4.5,5.0],dtype='float64')In [142]:sf=pd.Series(range(5),index=indexf)In [143]:sfOut[143]:1.5 02.0 13.0 24.5 35.0 4dtype: int64
Scalar selection for[],.ix,.loc will always be label based. An integer will match an equal float index (e.g.3 is equivalent to3.0)
In [144]:sf[3]Out[144]:2In [145]:sf[3.0]Out[145]:2In [146]:sf.ix[3]Out[146]:2In [147]:sf.ix[3.0]Out[147]:2In [148]:sf.loc[3]Out[148]:2In [149]:sf.loc[3.0]Out[149]:2
The only positional indexing is viailoc
In [150]:sf.iloc[3]Out[150]:3
A scalar index that is not found will raiseKeyError
Slicing is ALWAYS on the values of the index, for[],ix,loc and ALWAYS positional withiloc
In [151]:sf[2:4]Out[151]:2.0 13.0 2dtype: int64In [152]:sf.ix[2:4]Out[152]:2.0 13.0 2dtype: int64In [153]:sf.loc[2:4]Out[153]:2.0 13.0 2dtype: int64In [154]:sf.iloc[2:4]Out[154]:3.0 24.5 3dtype: int64
In float indexes, slicing using floats is allowed
In [155]:sf[2.1:4.6]Out[155]:3.0 24.5 3dtype: int64In [156]:sf.loc[2.1:4.6]Out[156]:3.0 24.5 3dtype: int64
In non-float indexes, slicing using floats will raise aTypeError
In [1]:pd.Series(range(5))[3.5]TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)In [1]:pd.Series(range(5))[3.5:4.5]TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)
Warning
Using a scalar float indexer for.iloc has been removed in 0.18.0, so the following will raise aTypeError
In [3]:pd.Series(range(5)).iloc[3.0]TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of <type 'float'>
Further the treatment of.ix with a float indexer on a non-float index, will be label based, and thus coerce the index.
In [157]:s2=pd.Series([1,2,3],index=list('abc'))In [158]:s2Out[158]:a 1b 2c 3dtype: int64In [159]:s2.ix[1.0]=10In [160]:s2Out[160]:a 1b 2c 31.0 10dtype: int64
Here is a typical use-case for using this type of indexing. Imagine that you have a somewhatirregular timedelta-like indexing scheme, but the data is recorded as floats. This could forexample be millisecond offsets.
In [161]:dfir=pd.concat([pd.DataFrame(np.random.randn(5,2), .....:index=np.arange(5)*250.0, .....:columns=list('AB')), .....:pd.DataFrame(np.random.randn(6,2), .....:index=np.arange(4,10)*250.1, .....:columns=list('AB'))]) .....:In [162]:dfirOut[162]: A B0.0 0.997289 -1.693316250.0 -0.179129 -1.598062500.0 0.936914 0.912560750.0 -1.003401 1.6327811000.0 -0.724626 0.1782191000.4 0.310610 -0.1080021250.5 -0.974226 -1.1477081500.6 -2.281374 0.7600101750.7 -0.742532 1.5333182000.8 2.495362 -0.4327712250.9 -0.068954 0.043520
Selection operations then will always work on a value basis, for all selection operators.
In [163]:dfir[0:1000.4]Out[163]: A B0.0 0.997289 -1.693316250.0 -0.179129 -1.598062500.0 0.936914 0.912560750.0 -1.003401 1.6327811000.0 -0.724626 0.1782191000.4 0.310610 -0.108002In [164]:dfir.loc[0:1001,'A']Out[164]:0.0 0.997289250.0 -0.179129500.0 0.936914750.0 -1.0034011000.0 -0.7246261000.4 0.310610Name: A, dtype: float64In [165]:dfir.loc[1000.4]Out[165]:A 0.310610B -0.108002Name: 1000.4, dtype: float64
You could then easily pick out the first 1 second (1000 ms) of data then.
In [166]:dfir[0:1000]Out[166]: A B0.0 0.997289 -1.693316250.0 -0.179129 -1.598062500.0 0.936914 0.912560750.0 -1.003401 1.6327811000.0 -0.724626 0.178219
Of course if you need integer based selection, then useiloc
In [167]:dfir.iloc[0:5]Out[167]: A B0.0 0.997289 -1.693316250.0 -0.179129 -1.598062500.0 0.936914 0.912560750.0 -1.003401 1.6327811000.0 -0.724626 0.178219