Movatterモバイル変換


[0]ホーム

URL:


Navigation

Table Of Contents

Search

Enter search terms or a module, class or function name.

MultiIndex / Advanced Indexing

This section covers indexing with aMultiIndex and more advanced indexing features.

See theIndexing and Selecting Data for general indexing documentation.

Warning

Whether a copy or a reference is returned for a setting operation, maydepend on the context. This is sometimes calledchainedassignment andshould be avoided. SeeReturning a View versus Copy

Warning

In 0.15.0Index has internally been refactored to no longer sub-classndarraybut instead subclassPandasObject, similarly to the rest of the pandas objects. This should bea transparent change with only very limited API implications (See theInternal Refactoring)

See thecookbook for some advanced strategies

Hierarchical indexing (MultiIndex)

Hierarchical / Multi-level indexing is very exciting as it opens the door to somequite sophisticated data analysis and manipulation, especially for working withhigher dimensional data. In essence, it enables you to store and manipulatedata with an arbitrary number of dimensions in lower dimensional datastructures like Series (1d) and DataFrame (2d).

In this section, we will show what exactly we mean by “hierarchical” indexingand how it integrates with the all of the pandas indexing functionalitydescribed above and in prior sections. Later, when discussinggroup by andpivoting and reshaping data, we’ll shownon-trivial applications to illustrate how it aids in structuring data foranalysis.

See thecookbook for some advanced strategies

Creating a MultiIndex (hierarchical index) object

TheMultiIndex object is the hierarchical analogue of the standardIndex object which typically stores the axis labels in pandas objects. Youcan think ofMultiIndex an array of tuples where each tuple is unique. AMultiIndex can be created from a list of arrays (usingMultiIndex.from_arrays), an array of tuples (usingMultiIndex.from_tuples), or a crossed set of iterables (usingMultiIndex.from_product). TheIndex constructor will attempt to returnaMultiIndex when it is passed a list of tuples. The following examplesdemo different ways to initialize MultiIndexes.

In [1]:arrays=[['bar','bar','baz','baz','foo','foo','qux','qux'],   ...:['one','two','one','two','one','two','one','two']]   ...:In [2]:tuples=list(zip(*arrays))In [3]:tuplesOut[3]:[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]In [4]:index=pd.MultiIndex.from_tuples(tuples,names=['first','second'])In [5]:indexOut[5]:MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],           names=[u'first', u'second'])In [6]:s=pd.Series(np.random.randn(8),index=index)In [7]:sOut[7]:first  secondbar    one       0.469112       two      -0.282863baz    one      -1.509059       two      -1.135632foo    one       1.212112       two      -0.173215qux    one       0.119209       two      -1.044236dtype: float64

When you want every pairing of the elements in two iterables, it can be easierto use theMultiIndex.from_product function:

In [8]:iterables=[['bar','baz','foo','qux'],['one','two']]In [9]:pd.MultiIndex.from_product(iterables,names=['first','second'])Out[9]:MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],           names=[u'first', u'second'])

As a convenience, you can pass a list of arrays directly into Series orDataFrame to construct a MultiIndex automatically:

In [10]:arrays=[np.array(['bar','bar','baz','baz','foo','foo','qux','qux']),   ....:np.array(['one','two','one','two','one','two','one','two'])]   ....:In [11]:s=pd.Series(np.random.randn(8),index=arrays)In [12]:sOut[12]:bar  one   -0.861849     two   -2.104569baz  one   -0.494929     two    1.071804foo  one    0.721555     two   -0.706771qux  one   -1.039575     two    0.271860dtype: float64In [13]:df=pd.DataFrame(np.random.randn(8,4),index=arrays)In [14]:dfOut[14]:                0         1         2         3bar one -0.424972  0.567020  0.276232 -1.087401    two -0.673690  0.113648 -1.478427  0.524988baz one  0.404705  0.577046 -1.715002 -1.039268    two -0.370647 -1.157892 -1.344312  0.844885foo one  1.075770 -0.109050  1.643563 -1.469388    two  0.357021 -0.674600 -1.776904 -0.968914qux one -1.294524  0.413738  0.276662 -0.472035    two -0.013960 -0.362543 -0.006154 -0.923061

All of theMultiIndex constructors accept anames argument which storesstring names for the levels themselves. If no names are provided,None willbe assigned:

In [15]:df.index.namesOut[15]:FrozenList([None,None])

This index can back any axis of a pandas object, and the number oflevelsof the index is up to you:

In [16]:df=pd.DataFrame(np.random.randn(3,8),index=['A','B','C'],columns=index)In [17]:dfOut[17]:first        bar                 baz                 foo                 qux  \second       one       two       one       two       one       two       oneA       0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309 -1.170299B       0.410835  0.813850  0.132003 -0.827317 -0.076467 -1.187678  1.130127C      -1.413681  1.607920  1.024180  0.569605  0.875906 -2.211372  0.974466firstsecond       twoA      -0.226169B      -1.436737C      -2.006747In [18]:pd.DataFrame(np.random.randn(6,6),index=index[:6],columns=index[:6])Out[18]:first              bar                 baz                 foosecond             one       two       one       two       one       twofirst secondbar   one    -0.410001 -0.078638  0.545952 -1.219217 -1.226825  0.769804      two    -1.281247 -0.727707 -0.121306 -0.097883  0.695775  0.341734baz   one     0.959726 -1.110336 -0.619976  0.149748 -0.732339  0.687738      two     0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849foo   one    -0.954208  1.462696 -1.743161 -0.826591 -0.345352  1.314232      two     0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441

We’ve “sparsified” the higher levels of the indexes to make the console output abit easier on the eyes.

It’s worth keeping in mind that there’s nothing preventing you from usingtuples as atomic labels on an axis:

In [19]:pd.Series(np.random.randn(8),index=tuples)Out[19]:(bar, one)   -1.236269(bar, two)    0.896171(baz, one)   -0.487602(baz, two)   -0.082240(foo, one)   -2.182937(foo, two)    0.380396(qux, one)    0.084844(qux, two)    0.432390dtype: float64

The reason that theMultiIndex matters is that it can allow you to dogrouping, selection, and reshaping operations as we will describe below and insubsequent areas of the documentation. As you will see in later sections, youcan find yourself working with hierarchically-indexed data without creating aMultiIndex explicitly yourself. However, when loading data from a file, youmay wish to generate your ownMultiIndex when preparing the data set.

Note that how the index is displayed by be controlled using themulti_sparse option inpandas.set_printoptions:

In [20]:pd.set_option('display.multi_sparse',False)In [21]:dfOut[21]:first        bar       bar       baz       baz       foo       foo       qux  \second       one       two       one       two       one       two       oneA       0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309 -1.170299B       0.410835  0.813850  0.132003 -0.827317 -0.076467 -1.187678  1.130127C      -1.413681  1.607920  1.024180  0.569605  0.875906 -2.211372  0.974466first        quxsecond       twoA      -0.226169B      -1.436737C      -2.006747In [22]:pd.set_option('display.multi_sparse',True)

Reconstructing the level labels

The methodget_level_values will return a vector of the labels for eachlocation at a particular level:

In [23]:index.get_level_values(0)Out[23]:Index([u'bar',u'bar',u'baz',u'baz',u'foo',u'foo',u'qux',u'qux'],dtype='object',name=u'first')In [24]:index.get_level_values('second')Out[24]:Index([u'one',u'two',u'one',u'two',u'one',u'two',u'one',u'two'],dtype='object',name=u'second')

Basic indexing on axis with MultiIndex

One of the important features of hierarchical indexing is that you can selectdata by a “partial” label identifying a subgroup in the data.Partialselection “drops” levels of the hierarchical index in the result in acompletely analogous way to selecting a column in a regular DataFrame:

In [25]:df['bar']Out[25]:second       one       twoA       0.895717  0.805244B       0.410835  0.813850C      -1.413681  1.607920In [26]:df['bar','one']Out[26]:A    0.895717B    0.410835C   -1.413681Name: (bar, one), dtype: float64In [27]:df['bar']['one']Out[27]:A    0.895717B    0.410835C   -1.413681Name: one, dtype: float64In [28]:s['qux']Out[28]:one   -1.039575two    0.271860dtype: float64

SeeCross-section with hierarchical index for how to selecton a deeper level.

Note

The repr of aMultiIndex shows ALL the defined levels of an index, evenif the they are not actually used. When slicing an index, you may notice this.For example:

# original multi-indexIn [29]:df.columnsOut[29]:MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],           names=[u'first', u'second'])# slicedIn [30]:df[['foo','qux']].columnsOut[30]:MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],           labels=[[2, 2, 3, 3], [0, 1, 0, 1]],           names=[u'first', u'second'])

This is done to avoid a recomputation of the levels in order to make slicinghighly performant. If you want to see the actual used levels.

In [31]:df[['foo','qux']].columns.valuesOut[31]:array([('foo','one'),('foo','two'),('qux','one'),('qux','two')],dtype=object)# for a specific levelIn [32]:df[['foo','qux']].columns.get_level_values(0)Out[32]:Index([u'foo',u'foo',u'qux',u'qux'],dtype='object',name=u'first')

To reconstruct the multiindex with only the used levels

In [33]:pd.MultiIndex.from_tuples(df[['foo','qux']].columns.values)Out[33]:MultiIndex(levels=[[u'foo', u'qux'], [u'one', u'two']],           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Data alignment and usingreindex

Operations between differently-indexed objects havingMultiIndex on theaxes will work as you expect; data alignment will work the same as an Index oftuples:

In [34]:s+s[:-2]Out[34]:bar  one   -1.723698     two   -4.209138baz  one   -0.989859     two    2.143608foo  one    1.443110     two   -1.413542qux  one         NaN     two         NaNdtype: float64In [35]:s+s[::2]Out[35]:bar  one   -1.723698     two         NaNbaz  one   -0.989859     two         NaNfoo  one    1.443110     two         NaNqux  one   -2.079150     two         NaNdtype: float64

reindex can be called with anotherMultiIndex or even a list or arrayof tuples:

In [36]:s.reindex(index[:3])Out[36]:first  secondbar    one      -0.861849       two      -2.104569baz    one      -0.494929dtype: float64In [37]:s.reindex([('foo','two'),('bar','one'),('qux','one'),('baz','one')])Out[37]:foo  two   -0.706771bar  one   -0.861849qux  one   -1.039575baz  one   -0.494929dtype: float64

Advanced indexing with hierarchical index

Syntactically integratingMultiIndex in advanced indexing with.loc/.ix is abit challenging, but we’ve made every effort to do so. for example thefollowing works as you would expect:

In [38]:df=df.TIn [39]:dfOut[39]:                     A         B         Cfirst secondbar   one     0.895717  0.410835 -1.413681      two     0.805244  0.813850  1.607920baz   one    -1.206412  0.132003  1.024180      two     2.565646 -0.827317  0.569605foo   one     1.431256 -0.076467  0.875906      two     1.340309 -1.187678 -2.211372qux   one    -1.170299  1.130127  0.974466      two    -0.226169 -1.436737 -2.006747In [40]:df.loc['bar']Out[40]:               A         B         Csecondone     0.895717  0.410835 -1.413681two     0.805244  0.813850  1.607920In [41]:df.loc['bar','two']Out[41]:A    0.805244B    0.813850C    1.607920Name: (bar, two), dtype: float64

“Partial” slicing also works quite nicely.

In [42]:df.loc['baz':'foo']Out[42]:                     A         B         Cfirst secondbaz   one    -1.206412  0.132003  1.024180      two     2.565646 -0.827317  0.569605foo   one     1.431256 -0.076467  0.875906      two     1.340309 -1.187678 -2.211372

You can slice with a ‘range’ of values, by providing a slice of tuples.

In [43]:df.loc[('baz','two'):('qux','one')]Out[43]:                     A         B         Cfirst secondbaz   two     2.565646 -0.827317  0.569605foo   one     1.431256 -0.076467  0.875906      two     1.340309 -1.187678 -2.211372qux   one    -1.170299  1.130127  0.974466In [44]:df.loc[('baz','two'):'foo']Out[44]:                     A         B         Cfirst secondbaz   two     2.565646 -0.827317  0.569605foo   one     1.431256 -0.076467  0.875906      two     1.340309 -1.187678 -2.211372

Passing a list of labels or tuples works similar to reindexing:

In [45]:df.ix[[('bar','two'),('qux','one')]]Out[45]:                     A         B         Cfirst secondbar   two     0.805244  0.813850  1.607920qux   one    -1.170299  1.130127  0.974466

Using slicers

New in version 0.14.0.

In 0.14.0 we added a new way to slice multi-indexed objects.You can slice a multi-index by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, seeSelection by Label,including slices, lists of labels, labels, and boolean indexers.

You can useslice(None) to select all the contents ofthat level. You do not need to specify all thedeeper levels, they will be implied asslice(None).

As usual,both sides of the slicers are included as this is label indexing.

Warning

You should specify all axes in the.loc specifier, meaning the indexer for theindex andfor thecolumns. There are some ambiguous cases where the passed indexer could be mis-interpretedas indexingboth axes, rather than into say the MuliIndex for the rows.

You should do this:

df.loc[(slice('A1','A3'),.....),:]

rather than this:

df.loc[(slice('A1','A3'),.....)]
In [46]:defmklbl(prefix,n):   ....:return["%s%s"%(prefix,i)foriinrange(n)]   ....:In [47]:miindex=pd.MultiIndex.from_product([mklbl('A',4),   ....:mklbl('B',2),   ....:mklbl('C',4),   ....:mklbl('D',2)])   ....:In [48]:micolumns=pd.MultiIndex.from_tuples([('a','foo'),('a','bar'),   ....:('b','foo'),('b','bah')],   ....:names=['lvl0','lvl1'])   ....:In [49]:dfmi=pd.DataFrame(np.arange(len(miindex)*len(micolumns)).reshape((len(miindex),len(micolumns))),   ....:index=miindex,   ....:columns=micolumns).sort_index().sort_index(axis=1)   ....:In [50]:dfmiOut[50]:lvl0           a         blvl1         bar  foo  bah  fooA0 B0 C0 D0    1    0    3    2         D1    5    4    7    6      C1 D0    9    8   11   10         D1   13   12   15   14      C2 D0   17   16   19   18         D1   21   20   23   22      C3 D0   25   24   27   26...          ...  ...  ...  ...A3 B1 C0 D1  229  228  231  230      C1 D0  233  232  235  234         D1  237  236  239  238      C2 D0  241  240  243  242         D1  245  244  247  246      C3 D0  249  248  251  250         D1  253  252  255  254[64 rows x 4 columns]

Basic multi-index slicing using slices, lists, and labels.

In [51]:dfmi.loc[(slice('A1','A3'),slice(None),['C1','C3']),:]Out[51]:lvl0           a         blvl1         bar  foo  bah  fooA1 B0 C1 D0   73   72   75   74         D1   77   76   79   78      C3 D0   89   88   91   90         D1   93   92   95   94   B1 C1 D0  105  104  107  106         D1  109  108  111  110      C3 D0  121  120  123  122...          ...  ...  ...  ...A3 B0 C1 D1  205  204  207  206      C3 D0  217  216  219  218         D1  221  220  223  222   B1 C1 D0  233  232  235  234         D1  237  236  239  238      C3 D0  249  248  251  250         D1  253  252  255  254[24 rows x 4 columns]

You can use apd.IndexSlice to have a more natural syntax using: rather than usingslice(None)

In [52]:idx=pd.IndexSliceIn [53]:dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']]Out[53]:lvl0           a    blvl1         foo  fooA0 B0 C1 D0    8   10         D1   12   14      C3 D0   24   26         D1   28   30   B1 C1 D0   40   42         D1   44   46      C3 D0   56   58...          ...  ...A3 B0 C1 D1  204  206      C3 D0  216  218         D1  220  222   B1 C1 D0  232  234         D1  236  238      C3 D0  248  250         D1  252  254[32 rows x 2 columns]

It is possible to perform quite complicated selections using this method on multipleaxes at the same time.

In [54]:dfmi.loc['A1',(slice(None),'foo')]Out[54]:lvl0        a    blvl1      foo  fooB0 C0 D0   64   66      D1   68   70   C1 D0   72   74      D1   76   78   C2 D0   80   82      D1   84   86   C3 D0   88   90...       ...  ...B1 C0 D1  100  102   C1 D0  104  106      D1  108  110   C2 D0  112  114      D1  116  118   C3 D0  120  122      D1  124  126[16 rows x 2 columns]In [55]:dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']]Out[55]:lvl0           a    blvl1         foo  fooA0 B0 C1 D0    8   10         D1   12   14      C3 D0   24   26         D1   28   30   B1 C1 D0   40   42         D1   44   46      C3 D0   56   58...          ...  ...A3 B0 C1 D1  204  206      C3 D0  216  218         D1  220  222   B1 C1 D0  232  234         D1  236  238      C3 D0  248  250         D1  252  254[32 rows x 2 columns]

Using a boolean indexer you can provide selection related to thevalues.

In [56]:mask=dfmi[('a','foo')]>200In [57]:dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]Out[57]:lvl0           a    blvl1         foo  fooA3 B0 C1 D1  204  206      C3 D0  216  218         D1  220  222   B1 C1 D0  232  234         D1  236  238      C3 D0  248  250         D1  252  254

You can also specify theaxis argument to.loc to interpret the passedslicers on a single axis.

In [58]:dfmi.loc(axis=0)[:,:,['C1','C3']]Out[58]:lvl0           a         blvl1         bar  foo  bah  fooA0 B0 C1 D0    9    8   11   10         D1   13   12   15   14      C3 D0   25   24   27   26         D1   29   28   31   30   B1 C1 D0   41   40   43   42         D1   45   44   47   46      C3 D0   57   56   59   58...          ...  ...  ...  ...A3 B0 C1 D1  205  204  207  206      C3 D0  217  216  219  218         D1  221  220  223  222   B1 C1 D0  233  232  235  234         D1  237  236  239  238      C3 D0  249  248  251  250         D1  253  252  255  254[32 rows x 4 columns]

Furthermore you canset the values using these methods

In [59]:df2=dfmi.copy()In [60]:df2.loc(axis=0)[:,:,['C1','C3']]=-10In [61]:df2Out[61]:lvl0           a         blvl1         bar  foo  bah  fooA0 B0 C0 D0    1    0    3    2         D1    5    4    7    6      C1 D0  -10  -10  -10  -10         D1  -10  -10  -10  -10      C2 D0   17   16   19   18         D1   21   20   23   22      C3 D0  -10  -10  -10  -10...          ...  ...  ...  ...A3 B1 C0 D1  229  228  231  230      C1 D0  -10  -10  -10  -10         D1  -10  -10  -10  -10      C2 D0  241  240  243  242         D1  245  244  247  246      C3 D0  -10  -10  -10  -10         D1  -10  -10  -10  -10[64 rows x 4 columns]

You can use a right-hand-side of an alignable object as well.

In [62]:df2=dfmi.copy()In [63]:df2.loc[idx[:,:,['C1','C3']],:]=df2*1000In [64]:df2Out[64]:lvl0              a               blvl1            bar     foo     bah     fooA0 B0 C0 D0       1       0       3       2         D1       5       4       7       6      C1 D0    9000    8000   11000   10000         D1   13000   12000   15000   14000      C2 D0      17      16      19      18         D1      21      20      23      22      C3 D0   25000   24000   27000   26000...             ...     ...     ...     ...A3 B1 C0 D1     229     228     231     230      C1 D0  233000  232000  235000  234000         D1  237000  236000  239000  238000      C2 D0     241     240     243     242         D1     245     244     247     246      C3 D0  249000  248000  251000  250000         D1  253000  252000  255000  254000[64 rows x 4 columns]

Cross-section

Thexs method ofDataFrame additionally takes a level argument to makeselecting data at a particular level of a MultiIndex easier.

In [65]:dfOut[65]:                     A         B         Cfirst secondbar   one     0.895717  0.410835 -1.413681      two     0.805244  0.813850  1.607920baz   one    -1.206412  0.132003  1.024180      two     2.565646 -0.827317  0.569605foo   one     1.431256 -0.076467  0.875906      two     1.340309 -1.187678 -2.211372qux   one    -1.170299  1.130127  0.974466      two    -0.226169 -1.436737 -2.006747In [66]:df.xs('one',level='second')Out[66]:              A         B         Cfirstbar    0.895717  0.410835 -1.413681baz   -1.206412  0.132003  1.024180foo    1.431256 -0.076467  0.875906qux   -1.170299  1.130127  0.974466
# using the slicers (new in 0.14.0)In [67]:df.loc[(slice(None),'one'),:]Out[67]:                     A         B         Cfirst secondbar   one     0.895717  0.410835 -1.413681baz   one    -1.206412  0.132003  1.024180foo   one     1.431256 -0.076467  0.875906qux   one    -1.170299  1.130127  0.974466

You can also select on the columns withxs(), byproviding the axis argument

In [68]:df=df.TIn [69]:df.xs('one',level='second',axis=1)Out[69]:first       bar       baz       foo       quxA      0.895717 -1.206412  1.431256 -1.170299B      0.410835  0.132003 -0.076467  1.130127C     -1.413681  1.024180  0.875906  0.974466
# using the slicers (new in 0.14.0)In [70]:df.loc[:,(slice(None),'one')]Out[70]:first        bar       baz       foo       quxsecond       one       one       one       oneA       0.895717 -1.206412  1.431256 -1.170299B       0.410835  0.132003 -0.076467  1.130127C      -1.413681  1.024180  0.875906  0.974466

xs() also allows selection with multiple keys

In [71]:df.xs(('one','bar'),level=('second','first'),axis=1)Out[71]:first        barsecond       oneA       0.895717B       0.410835C      -1.413681
# using the slicers (new in 0.14.0)In [72]:df.loc[:,('bar','one')]Out[72]:A    0.895717B    0.410835C   -1.413681Name: (bar, one), dtype: float64

New in version 0.13.0.

You can passdrop_level=False toxs() to retainthe level that was selected

In [73]:df.xs('one',level='second',axis=1,drop_level=False)Out[73]:first        bar       baz       foo       quxsecond       one       one       one       oneA       0.895717 -1.206412  1.431256 -1.170299B       0.410835  0.132003 -0.076467  1.130127C      -1.413681  1.024180  0.875906  0.974466

versus the result withdrop_level=True (the default value)

In [74]:df.xs('one',level='second',axis=1,drop_level=True)Out[74]:first       bar       baz       foo       quxA      0.895717 -1.206412  1.431256 -1.170299B      0.410835  0.132003 -0.076467  1.130127C     -1.413681  1.024180  0.875906  0.974466

Advanced reindexing and alignment

The parameterlevel has been added to thereindex andalign methodsof pandas objects. This is useful to broadcast values across a level. Forinstance:

In [75]:midx=pd.MultiIndex(levels=[['zero','one'],['x','y']],   ....:labels=[[1,1,0,0],[1,0,1,0]])   ....:In [76]:df=pd.DataFrame(np.random.randn(4,2),index=midx)In [77]:dfOut[77]:               0         1one  y  1.519970 -0.493662     x  0.600178  0.274230zero y  0.132885 -0.023688     x  2.410179  1.450520In [78]:df2=df.mean(level=0)In [79]:df2Out[79]:             0         1zero  1.271532  0.713416one   1.060074 -0.109716In [80]:df2.reindex(df.index,level=0)Out[80]:               0         1one  y  1.060074 -0.109716     x  1.060074 -0.109716zero y  1.271532  0.713416     x  1.271532  0.713416# aligningIn [81]:df_aligned,df2_aligned=df.align(df2,level=0)In [82]:df_alignedOut[82]:               0         1one  y  1.519970 -0.493662     x  0.600178  0.274230zero y  0.132885 -0.023688     x  2.410179  1.450520In [83]:df2_alignedOut[83]:               0         1one  y  1.060074 -0.109716     x  1.060074 -0.109716zero y  1.271532  0.713416     x  1.271532  0.713416

Swapping levels withswaplevel()

Theswaplevel function can switch the order of two levels:

In [84]:df[:5]Out[84]:               0         1one  y  1.519970 -0.493662     x  0.600178  0.274230zero y  0.132885 -0.023688     x  2.410179  1.450520In [85]:df[:5].swaplevel(0,1,axis=0)Out[85]:               0         1y one   1.519970 -0.493662x one   0.600178  0.274230y zero  0.132885 -0.023688x zero  2.410179  1.450520

Reordering levels withreorder_levels()

Thereorder_levels function generalizes theswaplevel function,allowing you to permute the hierarchical index levels in one step:

In [86]:df[:5].reorder_levels([1,0],axis=0)Out[86]:               0         1y one   1.519970 -0.493662x one   0.600178  0.274230y zero  0.132885 -0.023688x zero  2.410179  1.450520

Sorting aMultiIndex

For MultiIndex-ed objects to be indexed & sliced effectively, they needto be sorted. As with any index, you can usesort_index.

In [87]:importrandom;random.shuffle(tuples)In [88]:s=pd.Series(np.random.randn(8),index=pd.MultiIndex.from_tuples(tuples))In [89]:sOut[89]:qux  one    0.206053baz  two   -0.251905bar  one   -2.213588     two    1.063327foo  one    1.266143     two    0.299368baz  one   -0.863838qux  two    0.408204dtype: float64In [90]:s.sort_index()Out[90]:bar  one   -2.213588     two    1.063327baz  one   -0.863838     two   -0.251905foo  one    1.266143     two    0.299368qux  one    0.206053     two    0.408204dtype: float64In [91]:s.sort_index(level=0)Out[91]:bar  one   -2.213588     two    1.063327baz  one   -0.863838     two   -0.251905foo  one    1.266143     two    0.299368qux  one    0.206053     two    0.408204dtype: float64In [92]:s.sort_index(level=1)Out[92]:bar  one   -2.213588baz  one   -0.863838foo  one    1.266143qux  one    0.206053bar  two    1.063327baz  two   -0.251905foo  two    0.299368qux  two    0.408204dtype: float64

You may also pass a level name tosort_index if the MultiIndex levelsare named.

In [93]:s.index.set_names(['L1','L2'],inplace=True)In [94]:s.sort_index(level='L1')Out[94]:L1   L2bar  one   -2.213588     two    1.063327baz  one   -0.863838     two   -0.251905foo  one    1.266143     two    0.299368qux  one    0.206053     two    0.408204dtype: float64In [95]:s.sort_index(level='L2')Out[95]:L1   L2bar  one   -2.213588baz  one   -0.863838foo  one    1.266143qux  one    0.206053bar  two    1.063327baz  two   -0.251905foo  two    0.299368qux  two    0.408204dtype: float64

On higher dimensional objects, you can sort any of the other axes by level ifthey have a MultiIndex:

In [96]:df.T.sort_index(level=1,axis=1)Out[96]:       zero       one      zero       one          x         x         y         y0  2.410179  0.600178  0.132885  1.5199701  1.450520  0.274230 -0.023688 -0.493662

Indexing will work even if the data are not sorted, but will be ratherinefficient (and show aPerformanceWarning). It will alsoreturn a copy of the data rather than a view:

In [97]:dfm=pd.DataFrame({'jim':[0,0,1,1],   ....:'joe':['x','x','z','y'],   ....:'jolie':np.random.rand(4)})   ....:In [98]:dfm=dfm.set_index(['jim','joe'])In [99]:dfmOut[99]:            joliejim joe0   x    0.490671    x    0.1202481   z    0.537020    y    0.110968
In [4]:dfm.loc[(1,'z')]PerformanceWarning: indexing past lexsort depth may impact performance.Out[4]:           joliejim joe1   z    0.64094

Furthermore if you try to index something that is not fully lexsorted, this can raise:

In [5]:dfm.loc[(0,'y'):(1,'z')]KeyError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

Theis_lexsorted() method on anIndex show if the index is sorted, and thelexsort_depth property returns the sort depth:

In [100]:dfm.index.is_lexsorted()Out[100]:FalseIn [101]:dfm.index.lexsort_depthOut[101]:1
In [102]:dfm=dfm.sort_index()In [103]:dfmOut[103]:            joliejim joe0   x    0.490671    x    0.1202481   y    0.110968    z    0.537020In [104]:dfm.index.is_lexsorted()Out[104]:TrueIn [105]:dfm.index.lexsort_depthOut[105]:2

And now selection works as expected.

In [106]:dfm.loc[(0,'y'):(1,'z')]Out[106]:            joliejim joe1   y    0.110968    z    0.537020

Take Methods

Similar to numpy ndarrays, pandas Index, Series, and DataFrame also providesthetake method that retrieves elements along a given axis at the givenindices. The given indices must be either a list or an ndarray of integerindex positions.take will also accept negative integers as relative positions to the end of the object.

In [107]:index=pd.Index(np.random.randint(0,1000,10))In [108]:indexOut[108]:Int64Index([214,502,712,567,786,175,993,133,758,329],dtype='int64')In [109]:positions=[0,9,3]In [110]:index[positions]Out[110]:Int64Index([214,329,567],dtype='int64')In [111]:index.take(positions)Out[111]:Int64Index([214,329,567],dtype='int64')In [112]:ser=pd.Series(np.random.randn(10))In [113]:ser.iloc[positions]Out[113]:0   -0.1796669    1.8243753    0.392149dtype: float64In [114]:ser.take(positions)Out[114]:0   -0.1796669    1.8243753    0.392149dtype: float64

For DataFrames, the given indices should be a 1d list or ndarray that specifiesrow or column positions.

In [115]:frm=pd.DataFrame(np.random.randn(5,3))In [116]:frm.take([1,4,3])Out[116]:          0         1         21 -1.237881  0.106854 -1.2768294  0.629675 -1.425966  1.8577043  0.979542 -1.633678  0.615855In [117]:frm.take([0,2],axis=1)Out[117]:          0         20  0.595974  0.6015441 -1.237881 -1.2768292 -0.767101  1.4995913  0.979542  0.6158554  0.629675  1.857704

It is important to note that thetake method on pandas objects are notintended to work on boolean indices and may return unexpected results.

In [118]:arr=np.random.randn(10)In [119]:arr.take([False,False,True,True])Out[119]:array([-1.1935,-1.1935,0.6775,0.6775])In [120]:arr[[0,1]]Out[120]:array([-1.1935,0.6775])In [121]:ser=pd.Series(np.random.randn(10))In [122]:ser.take([False,False,True,True])Out[122]:0    0.2331410    0.2331411   -0.2235401   -0.223540dtype: float64In [123]:ser.ix[[0,1]]Out[123]:0    0.2331411   -0.223540dtype: float64

Finally, as a small note on performance, because thetake method handlesa narrower range of inputs, it can offer performance that is a good dealfaster than fancy indexing.

Index Types

We have discussedMultiIndex in the previous sections pretty extensively.DatetimeIndex andPeriodIndexare shownhere.TimedeltaIndex arehere.

In the following sub-sections we will highlite some other index types.

CategoricalIndex

New in version 0.16.1.

We introduce aCategoricalIndex, a new type of index object that is useful for supportingindexing with duplicates. This is a container around aCategorical (introduced in v0.15.0)and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,setting the index of aDataFrame/Series with acategory dtype would convert this to regular object-basedIndex.

In [124]:df=pd.DataFrame({'A':np.arange(6),   .....:'B':list('aabbca')})   .....:In [125]:df['B']=df['B'].astype('category',categories=list('cab'))In [126]:dfOut[126]:   A  B0  0  a1  1  a2  2  b3  3  b4  4  c5  5  aIn [127]:df.dtypesOut[127]:A       int64B    categorydtype: objectIn [128]:df.B.cat.categoriesOut[128]:Index([u'c',u'a',u'b'],dtype='object')

Setting the index, will create create aCategoricalIndex

In [129]:df2=df.set_index('B')In [130]:df2.indexOut[130]:CategoricalIndex([u'a',u'a',u'b',u'b',u'c',u'a'],categories=[u'c',u'a',u'b'],ordered=False,name=u'B',dtype='category')

Indexing with__getitem__/.iloc/.loc/.ix works similarly to anIndex with duplicates.The indexers MUST be in the category or the operation will raise.

In [131]:df2.loc['a']Out[131]:   ABa  0a  1a  5

These PRESERVE theCategoricalIndex

In [132]:df2.loc['a'].indexOut[132]:CategoricalIndex([u'a',u'a',u'a'],categories=[u'c',u'a',u'b'],ordered=False,name=u'B',dtype='category')

Sorting will order by the order of the categories

In [133]:df2.sort_index()Out[133]:   ABc  4a  0a  1a  5b  2b  3

Groupby operations on the index will preserve the index nature as well

In [134]:df2.groupby(level=0).sum()Out[134]:   ABc  4a  6b  5In [135]:df2.groupby(level=0).sum().indexOut[135]:CategoricalIndex([u'c',u'a',u'b'],categories=[u'c',u'a',u'b'],ordered=False,name=u'B',dtype='category')

Reindexing operations, will return a resulting index based on the type of the passedindexer, meaning that passing a list will return a plain-old-Index; indexing withaCategorical will return aCategoricalIndex, indexed according to the categoriesof the PASSEDCategorical dtype. This allows one to arbitrarly index these even withvalues NOT in the categories, similarly to how you can reindex ANY pandas index.

In [136]:df2.reindex(['a','e'])Out[136]:     ABa  0.0a  1.0a  5.0e  NaNIn [137]:df2.reindex(['a','e']).indexOut[137]:Index([u'a',u'a',u'a',u'e'],dtype='object',name=u'B')In [138]:df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))Out[138]:     ABa  0.0a  1.0a  5.0e  NaNIn [139]:df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).indexOut[139]:CategoricalIndex([u'a',u'a',u'a',u'e'],categories=[u'a',u'b',u'c',u'd',u'e'],ordered=False,name=u'B',dtype='category')

Warning

Reshaping and Comparison operations on aCategoricalIndex must have the same categoriesor aTypeError will be raised.

In[9]:df3=pd.DataFrame({'A':np.arange(6),'B':pd.Series(list('aabbca')).astype('category')})In[11]:df3=df3.set_index('B')In[11]:df3.indexOut[11]:CategoricalIndex([u'a',u'a',u'b',u'b',u'c',u'a'],categories=[u'a',u'b',u'c'],ordered=False,name=u'B',dtype='category')In[12]:pd.concat([df2,df3]TypeError:categoriesmustmatchexistingcategorieswhenappending

Int64Index and RangeIndex

Warning

Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, seehere.

Int64Index is a fundamental basic index inpandas. This is an Immutable array implementing an ordered, sliceable set.Prior to 0.18.0, theInt64Index would provide the default index for allNDFrame objects.

RangeIndex is a sub-class ofInt64Index added in version 0.18.0, now providing the default index for allNDFrame objects.RangeIndex is an optimized version ofInt64Index that can represent a monotonic ordered set. These are analagous to pythonrange types.

Float64Index

Note

As of 0.14.0,Float64Index is backed by a nativefloat64 dtypearray. Prior to 0.14.0,Float64Index was backed by anobject dtypearray. Using afloat64 dtype in the backend speeds up arithmeticoperations by about 30x and boolean indexing operations on theFloat64Index itself are about 2x as fast.

New in version 0.13.0.

By default aFloat64Index will be automatically created when passing floating, or mixed-integer-floating values in index creation.This enables a pure label-based slicing paradigm that makes[],ix,loc for scalar indexing and slicing work exactly thesame.

In [140]:indexf=pd.Index([1.5,2,3,4.5,5])In [141]:indexfOut[141]:Float64Index([1.5,2.0,3.0,4.5,5.0],dtype='float64')In [142]:sf=pd.Series(range(5),index=indexf)In [143]:sfOut[143]:1.5    02.0    13.0    24.5    35.0    4dtype: int64

Scalar selection for[],.ix,.loc will always be label based. An integer will match an equal float index (e.g.3 is equivalent to3.0)

In [144]:sf[3]Out[144]:2In [145]:sf[3.0]Out[145]:2In [146]:sf.ix[3]Out[146]:2In [147]:sf.ix[3.0]Out[147]:2In [148]:sf.loc[3]Out[148]:2In [149]:sf.loc[3.0]Out[149]:2

The only positional indexing is viailoc

In [150]:sf.iloc[3]Out[150]:3

A scalar index that is not found will raiseKeyError

Slicing is ALWAYS on the values of the index, for[],ix,loc and ALWAYS positional withiloc

In [151]:sf[2:4]Out[151]:2.0    13.0    2dtype: int64In [152]:sf.ix[2:4]Out[152]:2.0    13.0    2dtype: int64In [153]:sf.loc[2:4]Out[153]:2.0    13.0    2dtype: int64In [154]:sf.iloc[2:4]Out[154]:3.0    24.5    3dtype: int64

In float indexes, slicing using floats is allowed

In [155]:sf[2.1:4.6]Out[155]:3.0    24.5    3dtype: int64In [156]:sf.loc[2.1:4.6]Out[156]:3.0    24.5    3dtype: int64

In non-float indexes, slicing using floats will raise aTypeError

In [1]:pd.Series(range(5))[3.5]TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)In [1]:pd.Series(range(5))[3.5:4.5]TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)

Warning

Using a scalar float indexer for.iloc has been removed in 0.18.0, so the following will raise aTypeError

In [3]:pd.Series(range(5)).iloc[3.0]TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of <type 'float'>

Further the treatment of.ix with a float indexer on a non-float index, will be label based, and thus coerce the index.

In [157]:s2=pd.Series([1,2,3],index=list('abc'))In [158]:s2Out[158]:a    1b    2c    3dtype: int64In [159]:s2.ix[1.0]=10In [160]:s2Out[160]:a       1b       2c       31.0    10dtype: int64

Here is a typical use-case for using this type of indexing. Imagine that you have a somewhatirregular timedelta-like indexing scheme, but the data is recorded as floats. This could forexample be millisecond offsets.

In [161]:dfir=pd.concat([pd.DataFrame(np.random.randn(5,2),   .....:index=np.arange(5)*250.0,   .....:columns=list('AB')),   .....:pd.DataFrame(np.random.randn(6,2),   .....:index=np.arange(4,10)*250.1,   .....:columns=list('AB'))])   .....:In [162]:dfirOut[162]:               A         B0.0     0.997289 -1.693316250.0  -0.179129 -1.598062500.0   0.936914  0.912560750.0  -1.003401  1.6327811000.0 -0.724626  0.1782191000.4  0.310610 -0.1080021250.5 -0.974226 -1.1477081500.6 -2.281374  0.7600101750.7 -0.742532  1.5333182000.8  2.495362 -0.4327712250.9 -0.068954  0.043520

Selection operations then will always work on a value basis, for all selection operators.

In [163]:dfir[0:1000.4]Out[163]:               A         B0.0     0.997289 -1.693316250.0  -0.179129 -1.598062500.0   0.936914  0.912560750.0  -1.003401  1.6327811000.0 -0.724626  0.1782191000.4  0.310610 -0.108002In [164]:dfir.loc[0:1001,'A']Out[164]:0.0       0.997289250.0    -0.179129500.0     0.936914750.0    -1.0034011000.0   -0.7246261000.4    0.310610Name: A, dtype: float64In [165]:dfir.loc[1000.4]Out[165]:A    0.310610B   -0.108002Name: 1000.4, dtype: float64

You could then easily pick out the first 1 second (1000 ms) of data then.

In [166]:dfir[0:1000]Out[166]:               A         B0.0     0.997289 -1.693316250.0  -0.179129 -1.598062500.0   0.936914  0.912560750.0  -1.003401  1.6327811000.0 -0.724626  0.178219

Of course if you need integer based selection, then useiloc

In [167]:dfir.iloc[0:5]Out[167]:               A         B0.0     0.997289 -1.693316250.0  -0.179129 -1.598062500.0   0.936914  0.912560750.0  -1.003401  1.6327811000.0 -0.724626  0.178219

Navigation

Scroll To Top
[8]ページ先頭

©2009-2025 Movatter.jp