Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Ctrl+K

MultiIndex / advanced indexing#

This section coversindexing with a MultiIndexandother advanced indexing features.

See theIndexing and Selecting Data for general indexing documentation.

Warning

Whether a copy or a reference is returned for a setting operation maydepend on the context. This is sometimes calledchainedassignment andshould be avoided. SeeReturning a View versus Copy.

See thecookbook for some advanced strategies.

Hierarchical indexing (MultiIndex)#

Hierarchical / Multi-level indexing is very exciting as it opens the door to somequite sophisticated data analysis and manipulation, especially for working withhigher dimensional data. In essence, it enables you to store and manipulatedata with an arbitrary number of dimensions in lower dimensional datastructures likeSeries (1d) andDataFrame (2d).

In this section, we will show what exactly we mean by “hierarchical” indexingand how it integrates with all of the pandas indexing functionalitydescribed above and in prior sections. Later, when discussinggroup by andpivoting and reshaping data, we’ll shownon-trivial applications to illustrate how it aids in structuring data foranalysis.

See thecookbook for some advanced strategies.

Creating a MultiIndex (hierarchical index) object#

TheMultiIndex object is the hierarchical analogue of the standardIndex object which typically stores the axis labels in pandas objects. Youcan think ofMultiIndex as an array of tuples where each tuple is unique. AMultiIndex can be created from a list of arrays (usingMultiIndex.from_arrays()), an array of tuples (usingMultiIndex.from_tuples()), a crossed set of iterables (usingMultiIndex.from_product()), or aDataFrame (usingMultiIndex.from_frame()). TheIndex constructor will attempt to returnaMultiIndex when it is passed a list of tuples. The following examplesdemonstrate different ways to initialize MultiIndexes.

In [1]:arrays=[   ...:["bar","bar","baz","baz","foo","foo","qux","qux"],   ...:["one","two","one","two","one","two","one","two"],   ...:]   ...:In [2]:tuples=list(zip(*arrays))In [3]:tuplesOut[3]:[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]In [4]:index=pd.MultiIndex.from_tuples(tuples,names=["first","second"])In [5]:indexOut[5]:MultiIndex([('bar', 'one'),            ('bar', 'two'),            ('baz', 'one'),            ('baz', 'two'),            ('foo', 'one'),            ('foo', 'two'),            ('qux', 'one'),            ('qux', 'two')],           names=['first', 'second'])In [6]:s=pd.Series(np.random.randn(8),index=index)In [7]:sOut[7]:first  secondbar    one       0.469112       two      -0.282863baz    one      -1.509059       two      -1.135632foo    one       1.212112       two      -0.173215qux    one       0.119209       two      -1.044236dtype: float64

When you want every pairing of the elements in two iterables, it can be easierto use theMultiIndex.from_product() method:

In [8]:iterables=[["bar","baz","foo","qux"],["one","two"]]In [9]:pd.MultiIndex.from_product(iterables,names=["first","second"])Out[9]:MultiIndex([('bar', 'one'),            ('bar', 'two'),            ('baz', 'one'),            ('baz', 'two'),            ('foo', 'one'),            ('foo', 'two'),            ('qux', 'one'),            ('qux', 'two')],           names=['first', 'second'])

You can also construct aMultiIndex from aDataFrame directly, usingthe methodMultiIndex.from_frame(). This is a complementary method toMultiIndex.to_frame().

In [10]:df=pd.DataFrame(   ....:[["bar","one"],["bar","two"],["foo","one"],["foo","two"]],   ....:columns=["first","second"],   ....:)   ....:In [11]:pd.MultiIndex.from_frame(df)Out[11]:MultiIndex([('bar', 'one'),            ('bar', 'two'),            ('foo', 'one'),            ('foo', 'two')],           names=['first', 'second'])

As a convenience, you can pass a list of arrays directly intoSeries orDataFrame to construct aMultiIndex automatically:

In [12]:arrays=[   ....:np.array(["bar","bar","baz","baz","foo","foo","qux","qux"]),   ....:np.array(["one","two","one","two","one","two","one","two"]),   ....:]   ....:In [13]:s=pd.Series(np.random.randn(8),index=arrays)In [14]:sOut[14]:bar  one   -0.861849     two   -2.104569baz  one   -0.494929     two    1.071804foo  one    0.721555     two   -0.706771qux  one   -1.039575     two    0.271860dtype: float64In [15]:df=pd.DataFrame(np.random.randn(8,4),index=arrays)In [16]:dfOut[16]:                0         1         2         3bar one -0.424972  0.567020  0.276232 -1.087401    two -0.673690  0.113648 -1.478427  0.524988baz one  0.404705  0.577046 -1.715002 -1.039268    two -0.370647 -1.157892 -1.344312  0.844885foo one  1.075770 -0.109050  1.643563 -1.469388    two  0.357021 -0.674600 -1.776904 -0.968914qux one -1.294524  0.413738  0.276662 -0.472035    two -0.013960 -0.362543 -0.006154 -0.923061

All of theMultiIndex constructors accept anames argument which storesstring names for the levels themselves. If no names are provided,None willbe assigned:

In [17]:df.index.namesOut[17]:FrozenList([None, None])

This index can back any axis of a pandas object, and the number oflevelsof the index is up to you:

In [18]:df=pd.DataFrame(np.random.randn(3,8),index=["A","B","C"],columns=index)In [19]:dfOut[19]:first        bar                 baz  ...       foo       quxsecond       one       two       one  ...       two       one       twoA       0.895717  0.805244 -1.206412  ...  1.340309 -1.170299 -0.226169B       0.410835  0.813850  0.132003  ... -1.187678  1.130127 -1.436737C      -1.413681  1.607920  1.024180  ... -2.211372  0.974466 -2.006747[3 rows x 8 columns]In [20]:pd.DataFrame(np.random.randn(6,6),index=index[:6],columns=index[:6])Out[20]:first              bar                 baz                 foosecond             one       two       one       two       one       twofirst secondbar   one    -0.410001 -0.078638  0.545952 -1.219217 -1.226825  0.769804      two    -1.281247 -0.727707 -0.121306 -0.097883  0.695775  0.341734baz   one     0.959726 -1.110336 -0.619976  0.149748 -0.732339  0.687738      two     0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849foo   one    -0.954208  1.462696 -1.743161 -0.826591 -0.345352  1.314232      two     0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441

We’ve “sparsified” the higher levels of the indexes to make the console output abit easier on the eyes. Note that how the index is displayed can be controlled using themulti_sparse option inpandas.set_options():

In [21]:withpd.option_context("display.multi_sparse",False):   ....:df   ....:

It’s worth keeping in mind that there’s nothing preventing you from usingtuples as atomic labels on an axis:

In [22]:pd.Series(np.random.randn(8),index=tuples)Out[22]:(bar, one)   -1.236269(bar, two)    0.896171(baz, one)   -0.487602(baz, two)   -0.082240(foo, one)   -2.182937(foo, two)    0.380396(qux, one)    0.084844(qux, two)    0.432390dtype: float64

The reason that theMultiIndex matters is that it can allow you to dogrouping, selection, and reshaping operations as we will describe below and insubsequent areas of the documentation. As you will see in later sections, youcan find yourself working with hierarchically-indexed data without creating aMultiIndex explicitly yourself. However, when loading data from a file, youmay wish to generate your ownMultiIndex when preparing the data set.

Reconstructing the level labels#

The methodget_level_values() will return a vector of the labels for eachlocation at a particular level:

In [23]:index.get_level_values(0)Out[23]:Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')In [24]:index.get_level_values("second")Out[24]:Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

Basic indexing on axis with MultiIndex#

One of the important features of hierarchical indexing is that you can selectdata by a “partial” label identifying a subgroup in the data.Partialselection “drops” levels of the hierarchical index in the result in acompletely analogous way to selecting a column in a regular DataFrame:

In [25]:df["bar"]Out[25]:second       one       twoA       0.895717  0.805244B       0.410835  0.813850C      -1.413681  1.607920In [26]:df["bar","one"]Out[26]:A    0.895717B    0.410835C   -1.413681Name: (bar, one), dtype: float64In [27]:df["bar"]["one"]Out[27]:A    0.895717B    0.410835C   -1.413681Name: one, dtype: float64In [28]:s["qux"]Out[28]:one   -1.039575two    0.271860dtype: float64

SeeCross-section with hierarchical index for how to selecton a deeper level.

Defined levels#

TheMultiIndex keeps all the defined levels of an index, evenif they are not actually used. When slicing an index, you may notice this.For example:

In [29]:df.columns.levels# original MultiIndexOut[29]:FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])In [30]:df[["foo","qux"]].columns.levels# slicedOut[30]:FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

This is done to avoid a recomputation of the levels in order to make slicinghighly performant. If you want to see only the used levels, you can use theget_level_values() method.

In [31]:df[["foo","qux"]].columns.to_numpy()Out[31]:array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],      dtype=object)# for a specific levelIn [32]:df[["foo","qux"]].columns.get_level_values(0)Out[32]:Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

To reconstruct theMultiIndex with only the used levels, theremove_unused_levels() method may be used.

In [33]:new_mi=df[["foo","qux"]].columns.remove_unused_levels()In [34]:new_mi.levelsOut[34]:FrozenList([['foo', 'qux'], ['one', 'two']])

Data alignment and usingreindex#

Operations between differently-indexed objects havingMultiIndex on theaxes will work as you expect; data alignment will work the same as an Index oftuples:

In [35]:s+s[:-2]Out[35]:bar  one   -1.723698     two   -4.209138baz  one   -0.989859     two    2.143608foo  one    1.443110     two   -1.413542qux  one         NaN     two         NaNdtype: float64In [36]:s+s[::2]Out[36]:bar  one   -1.723698     two         NaNbaz  one   -0.989859     two         NaNfoo  one    1.443110     two         NaNqux  one   -2.079150     two         NaNdtype: float64

Thereindex() method ofSeries/DataFrames can becalled with anotherMultiIndex, or even a list or array of tuples:

In [37]:s.reindex(index[:3])Out[37]:first  secondbar    one      -0.861849       two      -2.104569baz    one      -0.494929dtype: float64In [38]:s.reindex([("foo","two"),("bar","one"),("qux","one"),("baz","one")])Out[38]:foo  two   -0.706771bar  one   -0.861849qux  one   -1.039575baz  one   -0.494929dtype: float64

Advanced indexing with hierarchical index#

Syntactically integratingMultiIndex in advanced indexing with.loc is abit challenging, but we’ve made every effort to do so. In general, MultiIndexkeys take the form of tuples. For example, the following works as you would expect:

In [39]:df=df.TIn [40]:dfOut[40]:                     A         B         Cfirst secondbar   one     0.895717  0.410835 -1.413681      two     0.805244  0.813850  1.607920baz   one    -1.206412  0.132003  1.024180      two     2.565646 -0.827317  0.569605foo   one     1.431256 -0.076467  0.875906      two     1.340309 -1.187678 -2.211372qux   one    -1.170299  1.130127  0.974466      two    -0.226169 -1.436737 -2.006747In [41]:df.loc[("bar","two")]Out[41]:A    0.805244B    0.813850C    1.607920Name: (bar, two), dtype: float64

Note thatdf.loc['bar','two'] would also work in this example, but this shorthandnotation can lead to ambiguity in general.

If you also want to index a specific column with.loc, you must use a tuplelike this:

In [42]:df.loc[("bar","two"),"A"]Out[42]:0.8052440253863785

You don’t have to specify all levels of theMultiIndex by passing only thefirst elements of the tuple. For example, you can use “partial” indexing toget all elements withbar in the first level as follows:

In [43]:df.loc["bar"]Out[43]:               A         B         Csecondone     0.895717  0.410835 -1.413681two     0.805244  0.813850  1.607920

This is a shortcut for the slightly more verbose notationdf.loc[('bar',),] (equivalenttodf.loc['bar',] in this example).

“Partial” slicing also works quite nicely.

In [44]:df.loc["baz":"foo"]Out[44]:                     A         B         Cfirst secondbaz   one    -1.206412  0.132003  1.024180      two     2.565646 -0.827317  0.569605foo   one     1.431256 -0.076467  0.875906      two     1.340309 -1.187678 -2.211372

You can slice with a ‘range’ of values, by providing a slice of tuples.

In [45]:df.loc[("baz","two"):("qux","one")]Out[45]:                     A         B         Cfirst secondbaz   two     2.565646 -0.827317  0.569605foo   one     1.431256 -0.076467  0.875906      two     1.340309 -1.187678 -2.211372qux   one    -1.170299  1.130127  0.974466In [46]:df.loc[("baz","two"):"foo"]Out[46]:                     A         B         Cfirst secondbaz   two     2.565646 -0.827317  0.569605foo   one     1.431256 -0.076467  0.875906      two     1.340309 -1.187678 -2.211372

Passing a list of labels or tuples works similar to reindexing:

In [47]:df.loc[[("bar","two"),("qux","one")]]Out[47]:                     A         B         Cfirst secondbar   two     0.805244  0.813850  1.607920qux   one    -1.170299  1.130127  0.974466

Note

It is important to note that tuples and lists are not treated identicallyin pandas when it comes to indexing. Whereas a tuple is interpreted as onemulti-level key, a list is used to specify several keys. Or in other words,tuples go horizontally (traversing levels), lists go vertically (scanning levels).

Importantly, a list of tuples indexes several completeMultiIndex keys,whereas a tuple of lists refer to several values within a level:

In [48]:s=pd.Series(   ....:[1,2,3,4,5,6],   ....:index=pd.MultiIndex.from_product([["A","B"],["c","d","e"]]),   ....:)   ....:In [49]:s.loc[[("A","c"),("B","d")]]# list of tuplesOut[49]:A  c    1B  d    5dtype: int64In [50]:s.loc[(["A","B"],["c","d"])]# tuple of listsOut[50]:A  c    1   d    2B  c    4   d    5dtype: int64

Using slicers#

You can slice aMultiIndex by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, seeSelection by Label,including slices, lists of labels, labels, and boolean indexers.

You can useslice(None) to select all the contents ofthat level. You do not need to specify all thedeeper levels, they will be implied asslice(None).

As usual,both sides of the slicers are included as this is label indexing.

Warning

You should specify all axes in the.loc specifier, meaning the indexer for theindex andfor thecolumns. There are some ambiguous cases where the passed indexer could be misinterpretedas indexingboth axes, rather than into say theMultiIndex for the rows.

You should do this:

df.loc[(slice("A1","A3"),...),:]# noqa: E999

You shouldnot do this:

df.loc[(slice("A1","A3"),...)]# noqa: E999
In [51]:defmklbl(prefix,n):   ....:return["%s%s"%(prefix,i)foriinrange(n)]   ....:In [52]:miindex=pd.MultiIndex.from_product(   ....:[mklbl("A",4),mklbl("B",2),mklbl("C",4),mklbl("D",2)]   ....:)   ....:In [53]:micolumns=pd.MultiIndex.from_tuples(   ....:[("a","foo"),("a","bar"),("b","foo"),("b","bah")],names=["lvl0","lvl1"]   ....:)   ....:In [54]:dfmi=(   ....:pd.DataFrame(   ....:np.arange(len(miindex)*len(micolumns)).reshape(   ....:(len(miindex),len(micolumns))   ....:),   ....:index=miindex,   ....:columns=micolumns,   ....:)   ....:.sort_index()   ....:.sort_index(axis=1)   ....:)   ....:In [55]:dfmiOut[55]:lvl0           a         blvl1         bar  foo  bah  fooA0 B0 C0 D0    1    0    3    2         D1    5    4    7    6      C1 D0    9    8   11   10         D1   13   12   15   14      C2 D0   17   16   19   18...          ...  ...  ...  ...A3 B1 C1 D1  237  236  239  238      C2 D0  241  240  243  242         D1  245  244  247  246      C3 D0  249  248  251  250         D1  253  252  255  254[64 rows x 4 columns]

Basic MultiIndex slicing using slices, lists, and labels.

In [56]:dfmi.loc[(slice("A1","A3"),slice(None),["C1","C3"]),:]Out[56]:lvl0           a         blvl1         bar  foo  bah  fooA1 B0 C1 D0   73   72   75   74         D1   77   76   79   78      C3 D0   89   88   91   90         D1   93   92   95   94   B1 C1 D0  105  104  107  106...          ...  ...  ...  ...A3 B0 C3 D1  221  220  223  222   B1 C1 D0  233  232  235  234         D1  237  236  239  238      C3 D0  249  248  251  250         D1  253  252  255  254[24 rows x 4 columns]

You can usepandas.IndexSlice to facilitate a more natural syntaxusing:, rather than usingslice(None).

In [57]:idx=pd.IndexSliceIn [58]:dfmi.loc[idx[:,:,["C1","C3"]],idx[:,"foo"]]Out[58]:lvl0           a    blvl1         foo  fooA0 B0 C1 D0    8   10         D1   12   14      C3 D0   24   26         D1   28   30   B1 C1 D0   40   42...          ...  ...A3 B0 C3 D1  220  222   B1 C1 D0  232  234         D1  236  238      C3 D0  248  250         D1  252  254[32 rows x 2 columns]

It is possible to perform quite complicated selections using this method on multipleaxes at the same time.

In [59]:dfmi.loc["A1",(slice(None),"foo")]Out[59]:lvl0        a    blvl1      foo  fooB0 C0 D0   64   66      D1   68   70   C1 D0   72   74      D1   76   78   C2 D0   80   82...       ...  ...B1 C1 D1  108  110   C2 D0  112  114      D1  116  118   C3 D0  120  122      D1  124  126[16 rows x 2 columns]In [60]:dfmi.loc[idx[:,:,["C1","C3"]],idx[:,"foo"]]Out[60]:lvl0           a    blvl1         foo  fooA0 B0 C1 D0    8   10         D1   12   14      C3 D0   24   26         D1   28   30   B1 C1 D0   40   42...          ...  ...A3 B0 C3 D1  220  222   B1 C1 D0  232  234         D1  236  238      C3 D0  248  250         D1  252  254[32 rows x 2 columns]

Using a boolean indexer you can provide selection related to thevalues.

In [61]:mask=dfmi[("a","foo")]>200In [62]:dfmi.loc[idx[mask,:,["C1","C3"]],idx[:,"foo"]]Out[62]:lvl0           a    blvl1         foo  fooA3 B0 C1 D1  204  206      C3 D0  216  218         D1  220  222   B1 C1 D0  232  234         D1  236  238      C3 D0  248  250         D1  252  254

You can also specify theaxis argument to.loc to interpret the passedslicers on a single axis.

In [63]:dfmi.loc(axis=0)[:,:,["C1","C3"]]Out[63]:lvl0           a         blvl1         bar  foo  bah  fooA0 B0 C1 D0    9    8   11   10         D1   13   12   15   14      C3 D0   25   24   27   26         D1   29   28   31   30   B1 C1 D0   41   40   43   42...          ...  ...  ...  ...A3 B0 C3 D1  221  220  223  222   B1 C1 D0  233  232  235  234         D1  237  236  239  238      C3 D0  249  248  251  250         D1  253  252  255  254[32 rows x 4 columns]

Furthermore, you canset the values using the following methods.

In [64]:df2=dfmi.copy()In [65]:df2.loc(axis=0)[:,:,["C1","C3"]]=-10In [66]:df2Out[66]:lvl0           a         blvl1         bar  foo  bah  fooA0 B0 C0 D0    1    0    3    2         D1    5    4    7    6      C1 D0  -10  -10  -10  -10         D1  -10  -10  -10  -10      C2 D0   17   16   19   18...          ...  ...  ...  ...A3 B1 C1 D1  -10  -10  -10  -10      C2 D0  241  240  243  242         D1  245  244  247  246      C3 D0  -10  -10  -10  -10         D1  -10  -10  -10  -10[64 rows x 4 columns]

You can use a right-hand-side of an alignable object as well.

In [67]:df2=dfmi.copy()In [68]:df2.loc[idx[:,:,["C1","C3"]],:]=df2*1000In [69]:df2Out[69]:lvl0              a               blvl1            bar     foo     bah     fooA0 B0 C0 D0       1       0       3       2         D1       5       4       7       6      C1 D0    9000    8000   11000   10000         D1   13000   12000   15000   14000      C2 D0      17      16      19      18...             ...     ...     ...     ...A3 B1 C1 D1  237000  236000  239000  238000      C2 D0     241     240     243     242         D1     245     244     247     246      C3 D0  249000  248000  251000  250000         D1  253000  252000  255000  254000[64 rows x 4 columns]

Cross-section#

Thexs() method ofDataFrame additionally takes a level argument to makeselecting data at a particular level of aMultiIndex easier.

In [70]:dfOut[70]:                     A         B         Cfirst secondbar   one     0.895717  0.410835 -1.413681      two     0.805244  0.813850  1.607920baz   one    -1.206412  0.132003  1.024180      two     2.565646 -0.827317  0.569605foo   one     1.431256 -0.076467  0.875906      two     1.340309 -1.187678 -2.211372qux   one    -1.170299  1.130127  0.974466      two    -0.226169 -1.436737 -2.006747In [71]:df.xs("one",level="second")Out[71]:              A         B         Cfirstbar    0.895717  0.410835 -1.413681baz   -1.206412  0.132003  1.024180foo    1.431256 -0.076467  0.875906qux   -1.170299  1.130127  0.974466
# using the slicersIn [72]:df.loc[(slice(None),"one"),:]Out[72]:                     A         B         Cfirst secondbar   one     0.895717  0.410835 -1.413681baz   one    -1.206412  0.132003  1.024180foo   one     1.431256 -0.076467  0.875906qux   one    -1.170299  1.130127  0.974466

You can also select on the columns withxs, byproviding the axis argument.

In [73]:df=df.TIn [74]:df.xs("one",level="second",axis=1)Out[74]:first       bar       baz       foo       quxA      0.895717 -1.206412  1.431256 -1.170299B      0.410835  0.132003 -0.076467  1.130127C     -1.413681  1.024180  0.875906  0.974466
# using the slicersIn [75]:df.loc[:,(slice(None),"one")]Out[75]:first        bar       baz       foo       quxsecond       one       one       one       oneA       0.895717 -1.206412  1.431256 -1.170299B       0.410835  0.132003 -0.076467  1.130127C      -1.413681  1.024180  0.875906  0.974466

xs also allows selection with multiple keys.

In [76]:df.xs(("one","bar"),level=("second","first"),axis=1)Out[76]:first        barsecond       oneA       0.895717B       0.410835C      -1.413681
# using the slicersIn [77]:df.loc[:,("bar","one")]Out[77]:A    0.895717B    0.410835C   -1.413681Name: (bar, one), dtype: float64

You can passdrop_level=False toxs to retainthe level that was selected.

In [78]:df.xs("one",level="second",axis=1,drop_level=False)Out[78]:first        bar       baz       foo       quxsecond       one       one       one       oneA       0.895717 -1.206412  1.431256 -1.170299B       0.410835  0.132003 -0.076467  1.130127C      -1.413681  1.024180  0.875906  0.974466

Compare the above with the result usingdrop_level=True (the default value).

In [79]:df.xs("one",level="second",axis=1,drop_level=True)Out[79]:first       bar       baz       foo       quxA      0.895717 -1.206412  1.431256 -1.170299B      0.410835  0.132003 -0.076467  1.130127C     -1.413681  1.024180  0.875906  0.974466

Advanced reindexing and alignment#

Using the parameterlevel in thereindex() andalign() methods of pandas objects is useful to broadcastvalues across a level. For instance:

In [80]:midx=pd.MultiIndex(   ....:levels=[["zero","one"],["x","y"]],codes=[[1,1,0,0],[1,0,1,0]]   ....:)   ....:In [81]:df=pd.DataFrame(np.random.randn(4,2),index=midx)In [82]:dfOut[82]:               0         1one  y  1.519970 -0.493662     x  0.600178  0.274230zero y  0.132885 -0.023688     x  2.410179  1.450520In [83]:df2=df.groupby(level=0).mean()In [84]:df2Out[84]:             0         1one   1.060074 -0.109716zero  1.271532  0.713416In [85]:df2.reindex(df.index,level=0)Out[85]:               0         1one  y  1.060074 -0.109716     x  1.060074 -0.109716zero y  1.271532  0.713416     x  1.271532  0.713416# aligningIn [86]:df_aligned,df2_aligned=df.align(df2,level=0)In [87]:df_alignedOut[87]:               0         1one  y  1.519970 -0.493662     x  0.600178  0.274230zero y  0.132885 -0.023688     x  2.410179  1.450520In [88]:df2_alignedOut[88]:               0         1one  y  1.060074 -0.109716     x  1.060074 -0.109716zero y  1.271532  0.713416     x  1.271532  0.713416

Swapping levels withswaplevel#

Theswaplevel() method can switch the order of two levels:

In [89]:df[:5]Out[89]:               0         1one  y  1.519970 -0.493662     x  0.600178  0.274230zero y  0.132885 -0.023688     x  2.410179  1.450520In [90]:df[:5].swaplevel(0,1,axis=0)Out[90]:               0         1y one   1.519970 -0.493662x one   0.600178  0.274230y zero  0.132885 -0.023688x zero  2.410179  1.450520

Reordering levels withreorder_levels#

Thereorder_levels() method generalizes theswaplevelmethod, allowing you to permute the hierarchical index levels in one step:

In [91]:df[:5].reorder_levels([1,0],axis=0)Out[91]:               0         1y one   1.519970 -0.493662x one   0.600178  0.274230y zero  0.132885 -0.023688x zero  2.410179  1.450520

Renaming names of anIndex orMultiIndex#

Therename() method is used to rename the labels of aMultiIndex, and is typically used to rename the columns of aDataFrame.Thecolumns argument ofrename allows a dictionary to be specifiedthat includes only the columns you wish to rename.

In [92]:df.rename(columns={0:"col0",1:"col1"})Out[92]:            col0      col1one  y  1.519970 -0.493662     x  0.600178  0.274230zero y  0.132885 -0.023688     x  2.410179  1.450520

This method can also be used to rename specific labels of the main indexof theDataFrame.

In [93]:df.rename(index={"one":"two","y":"z"})Out[93]:               0         1two  z  1.519970 -0.493662     x  0.600178  0.274230zero z  0.132885 -0.023688     x  2.410179  1.450520

Therename_axis() method is used to rename the name of aIndex orMultiIndex. In particular, the names of the levels of aMultiIndex can be specified, which is useful ifreset_index() is laterused to move the values from theMultiIndex to a column.

In [94]:df.rename_axis(index=["abc","def"])Out[94]:                 0         1abc  defone  y    1.519970 -0.493662     x    0.600178  0.274230zero y    0.132885 -0.023688     x    2.410179  1.450520

Note that the columns of aDataFrame are an index, so that usingrename_axis with thecolumns argument will change the name of thatindex.

In [95]:df.rename_axis(columns="Cols").columnsOut[95]:RangeIndex(start=0, stop=2, step=1, name='Cols')

Bothrename andrename_axis support specifying a dictionary,Series or a mapping function to map labels/names to new values.

When working with anIndex object directly, rather than via aDataFrame,Index.set_names() can be used to change the names.

In [96]:mi=pd.MultiIndex.from_product([[1,2],["a","b"]],names=["x","y"])In [97]:mi.namesOut[97]:FrozenList(['x', 'y'])In [98]:mi2=mi.rename("new name",level=0)In [99]:mi2Out[99]:MultiIndex([(1, 'a'),            (1, 'b'),            (2, 'a'),            (2, 'b')],           names=['new name', 'y'])

You cannot set the names of the MultiIndex via a level.

In [100]:mi.levels[0].name="name via level"---------------------------------------------------------------------------RuntimeErrorTraceback (most recent call last)CellIn[100],line1---->1mi.levels[0].name="name via level"File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, inIndex.name(self, value)1686@name.setter1687defname(self,value:Hashable)->None:1688ifself._no_setting_name:1689# Used in MultiIndex.levels to avoid silently ignoring name updates.->1690raiseRuntimeError(1691"Cannot set name on a level of a MultiIndex. Use "1692"'MultiIndex.set_names' instead."1693)1694maybe_extract_name(value,None,type(self))1695self._name=valueRuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.

UseIndex.set_names() instead.

Sorting aMultiIndex#

ForMultiIndex-ed objects to be indexed and sliced effectively,they need to be sorted. As with any index, you can usesort_index().

In [101]:importrandomIn [102]:random.shuffle(tuples)In [103]:s=pd.Series(np.random.randn(8),index=pd.MultiIndex.from_tuples(tuples))In [104]:sOut[104]:baz  one    0.206053bar  one   -0.251905baz  two   -2.213588qux  two    1.063327bar  two    1.266143qux  one    0.299368foo  two   -0.863838     one    0.408204dtype: float64In [105]:s.sort_index()Out[105]:bar  one   -0.251905     two    1.266143baz  one    0.206053     two   -2.213588foo  one    0.408204     two   -0.863838qux  one    0.299368     two    1.063327dtype: float64In [106]:s.sort_index(level=0)Out[106]:bar  one   -0.251905     two    1.266143baz  one    0.206053     two   -2.213588foo  one    0.408204     two   -0.863838qux  one    0.299368     two    1.063327dtype: float64In [107]:s.sort_index(level=1)Out[107]:bar  one   -0.251905baz  one    0.206053foo  one    0.408204qux  one    0.299368bar  two    1.266143baz  two   -2.213588foo  two   -0.863838qux  two    1.063327dtype: float64

You may also pass a level name tosort_index if theMultiIndex levelsare named.

In [108]:s.index=s.index.set_names(["L1","L2"])In [109]:s.sort_index(level="L1")Out[109]:L1   L2bar  one   -0.251905     two    1.266143baz  one    0.206053     two   -2.213588foo  one    0.408204     two   -0.863838qux  one    0.299368     two    1.063327dtype: float64In [110]:s.sort_index(level="L2")Out[110]:L1   L2bar  one   -0.251905baz  one    0.206053foo  one    0.408204qux  one    0.299368bar  two    1.266143baz  two   -2.213588foo  two   -0.863838qux  two    1.063327dtype: float64

On higher dimensional objects, you can sort any of the other axes by level ifthey have aMultiIndex:

In [111]:df.T.sort_index(level=1,axis=1)Out[111]:        one      zero       one      zero          x         x         y         y0  0.600178  2.410179  1.519970  0.1328851  0.274230  1.450520 -0.493662 -0.023688

Indexing will work even if the data are not sorted, but will be ratherinefficient (and show aPerformanceWarning). It will alsoreturn a copy of the data rather than a view:

In [112]:dfm=pd.DataFrame(   .....:{"jim":[0,0,1,1],"joe":["x","x","z","y"],"jolie":np.random.rand(4)}   .....:)   .....:In [113]:dfm=dfm.set_index(["jim","joe"])In [114]:dfmOut[114]:            joliejim joe0   x    0.490671    x    0.1202481   z    0.537020    y    0.110968In [115]:dfm.loc[(1,'z')]Out[115]:           joliejim joe1   z    0.53702

Furthermore, if you try to index something that is not fully lexsorted, this can raise:

In [116]:dfm.loc[(0,'y'):(1,'z')]---------------------------------------------------------------------------UnsortedIndexErrorTraceback (most recent call last)CellIn[116],line1---->1dfm.loc[(0,'y'):(1,'z')]File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in_LocationIndexer.__getitem__(self, key)1189maybe_callable=com.apply_if_callable(key,self.obj)1190maybe_callable=self._check_deprecated_callable_usage(key,maybe_callable)->1191returnself._getitem_axis(maybe_callable,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in_LocIndexer._getitem_axis(self, key, axis)1409ifisinstance(key,slice):1410self._validate_key(key,axis)->1411returnself._get_slice_axis(key,axis=axis)1412elifcom.is_bool_indexer(key):1413returnself._getbool_axis(key,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in_LocIndexer._get_slice_axis(self, slice_obj, axis)1440returnobj.copy(deep=False)1442labels=obj._get_axis(axis)->1443indexer=labels.slice_indexer(slice_obj.start,slice_obj.stop,slice_obj.step)1445ifisinstance(indexer,slice):1446returnself.obj._slice(indexer,axis=axis)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, inIndex.slice_indexer(self, start, end, step)6618defslice_indexer(6619self,6620start:Hashable|None=None,6621end:Hashable|None=None,6622step:int|None=None,6623)->slice:6624"""6625     Compute the slice indexer for input labels and step.6626   (...)6660     slice(1, 3, None)6661     """->6662start_slice,end_slice=self.slice_locs(start,end,step=step)6664# return a slice6665ifnotis_scalar(start_slice):File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2904, inMultiIndex.slice_locs(self, start, end, step)2852"""2853 For an ordered MultiIndex, compute the slice locations for input2854 labels.   (...)2900                       sequence of such.2901 """2902# This function adds nothing to its parent implementation (the magic2903# happens in get_slice_bound method), but it adds meaningful doc.->2904returnsuper().slice_locs(start,end,step)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, inIndex.slice_locs(self, start, end, step)6877start_slice=None6878ifstartisnotNone:->6879start_slice=self.get_slice_bound(start,"left")6880ifstart_sliceisNone:6881start_slice=0File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2848, inMultiIndex.get_slice_bound(self, label, side)2846ifnotisinstance(label,tuple):2847label=(label,)->2848returnself._partial_tup_index(label,side=side)File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2908, inMultiIndex._partial_tup_index(self, tup, side)2906def_partial_tup_index(self,tup:tuple,side:Literal["left","right"]="left"):2907iflen(tup)>self._lexsort_depth:->2908raiseUnsortedIndexError(2909f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "2910f"({self._lexsort_depth})"2911)2913n=len(tup)2914start,end=0,len(self)UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

Theis_monotonic_increasing() method on aMultiIndex shows if theindex is sorted:

In [117]:dfm.index.is_monotonic_increasingOut[117]:False
In [118]:dfm=dfm.sort_index()In [119]:dfmOut[119]:            joliejim joe0   x    0.490671    x    0.1202481   y    0.110968    z    0.537020In [120]:dfm.index.is_monotonic_increasingOut[120]:True

And now selection works as expected.

In [121]:dfm.loc[(0,"y"):(1,"z")]Out[121]:            joliejim joe1   y    0.110968    z    0.537020

Take methods#

Similar to NumPy ndarrays, pandasIndex,Series, andDataFrame also providesthetake() method that retrieves elements along a given axis at the givenindices. The given indices must be either a list or an ndarray of integerindex positions.take will also accept negative integers as relative positions to the end of the object.

In [122]:index=pd.Index(np.random.randint(0,1000,10))In [123]:indexOut[123]:Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')In [124]:positions=[0,9,3]In [125]:index[positions]Out[125]:Index([214, 329, 567], dtype='int64')In [126]:index.take(positions)Out[126]:Index([214, 329, 567], dtype='int64')In [127]:ser=pd.Series(np.random.randn(10))In [128]:ser.iloc[positions]Out[128]:0   -0.1796669    1.8243753    0.392149dtype: float64In [129]:ser.take(positions)Out[129]:0   -0.1796669    1.8243753    0.392149dtype: float64

For DataFrames, the given indices should be a 1d list or ndarray that specifiesrow or column positions.

In [130]:frm=pd.DataFrame(np.random.randn(5,3))In [131]:frm.take([1,4,3])Out[131]:          0         1         21 -1.237881  0.106854 -1.2768294  0.629675 -1.425966  1.8577043  0.979542 -1.633678  0.615855In [132]:frm.take([0,2],axis=1)Out[132]:          0         20  0.595974  0.6015441 -1.237881 -1.2768292 -0.767101  1.4995913  0.979542  0.6158554  0.629675  1.857704

It is important to note that thetake method on pandas objects are notintended to work on boolean indices and may return unexpected results.

In [133]:arr=np.random.randn(10)In [134]:arr.take([False,False,True,True])Out[134]:array([-1.1935, -1.1935,  0.6775,  0.6775])In [135]:arr[[0,1]]Out[135]:array([-1.1935,  0.6775])In [136]:ser=pd.Series(np.random.randn(10))In [137]:ser.take([False,False,True,True])Out[137]:0    0.2331410    0.2331411   -0.2235401   -0.223540dtype: float64In [138]:ser.iloc[[0,1]]Out[138]:0    0.2331411   -0.223540dtype: float64

Finally, as a small note on performance, because thetake method handlesa narrower range of inputs, it can offer performance that is a good dealfaster than fancy indexing.

In [139]:arr=np.random.randn(10000,5)In [140]:indexer=np.arange(10000)In [141]:random.shuffle(indexer)In [142]:%timeit arr[indexer]   .....:%timeit arr.take(indexer, axis=0)   .....:262 us +- 15.4 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)75.7 us +- 3.63 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
In [143]:ser=pd.Series(arr[:,0])In [144]:%timeit ser.iloc[indexer]   .....:%timeit ser.take(indexer)   .....:141 us +- 6.06 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)140 us +- 7.41 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)

Index types#

We have discussedMultiIndex in the previous sections pretty extensively.Documentation aboutDatetimeIndex andPeriodIndex are shownhere,and documentation aboutTimedeltaIndex is foundhere.

In the following sub-sections we will highlight some other index types.

CategoricalIndex#

CategoricalIndex is a type of index that is useful for supportingindexing with duplicates. This is a container around aCategoricaland allows efficient indexing and storage of an index with a large number of duplicated elements.

In [145]:frompandas.api.typesimportCategoricalDtypeIn [146]:df=pd.DataFrame({"A":np.arange(6),"B":list("aabbca")})In [147]:df["B"]=df["B"].astype(CategoricalDtype(list("cab")))In [148]:dfOut[148]:   A  B0  0  a1  1  a2  2  b3  3  b4  4  c5  5  aIn [149]:df.dtypesOut[149]:A       int64B    categorydtype: objectIn [150]:df["B"].cat.categoriesOut[150]:Index(['c', 'a', 'b'], dtype='object')

Setting the index will create aCategoricalIndex.

In [151]:df2=df.set_index("B")In [152]:df2.indexOut[152]:CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

Indexing with__getitem__/.iloc/.loc works similarly to anIndex with duplicates.The indexersmust be in the category or the operation will raise aKeyError.

In [153]:df2.loc["a"]Out[153]:   ABa  0a  1a  5

TheCategoricalIndex ispreserved after indexing:

In [154]:df2.loc["a"].indexOut[154]:CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

Sorting the index will sort by the order of the categories (recall that wecreated the index withCategoricalDtype(list('cab')), so the sortedorder iscab).

In [155]:df2.sort_index()Out[155]:   ABc  4a  0a  1a  5b  2b  3

Groupby operations on the index will preserve the index nature as well.

In [156]:df2.groupby(level=0,observed=True).sum()Out[156]:   ABc  4a  6b  5In [157]:df2.groupby(level=0,observed=True).sum().indexOut[157]:CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

Reindexing operations will return a resulting index based on the type of the passedindexer. Passing a list will return a plain-oldIndex; indexing withaCategorical will return aCategoricalIndex, indexed according to the categoriesof thepassedCategorical dtype. This allows one to arbitrarily index these even withvaluesnot in the categories, similarly to how you can reindexany pandas index.

In [158]:df3=pd.DataFrame(   .....:{"A":np.arange(3),"B":pd.Series(list("abc")).astype("category")}   .....:)   .....:In [159]:df3=df3.set_index("B")In [160]:df3Out[160]:   ABa  0b  1c  2
In [161]:df3.reindex(["a","e"])Out[161]:     ABa  0.0e  NaNIn [162]:df3.reindex(["a","e"]).indexOut[162]:Index(['a', 'e'], dtype='object', name='B')In [163]:df3.reindex(pd.Categorical(["a","e"],categories=list("abe")))Out[163]:     ABa  0.0e  NaNIn [164]:df3.reindex(pd.Categorical(["a","e"],categories=list("abe"))).indexOut[164]:CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B')

Warning

Reshaping and Comparison operations on aCategoricalIndex must have the same categoriesor aTypeError will be raised.

In [165]:df4=pd.DataFrame({"A":np.arange(2),"B":list("ba")})In [166]:df4["B"]=df4["B"].astype(CategoricalDtype(list("ab")))In [167]:df4=df4.set_index("B")In [168]:df4.indexOut[168]:CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')In [169]:df5=pd.DataFrame({"A":np.arange(2),"B":list("bc")})In [170]:df5["B"]=df5["B"].astype(CategoricalDtype(list("bc")))In [171]:df5=df5.set_index("B")In [172]:df5.indexOut[172]:CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B')
In [173]:pd.concat([df4,df5])Out[173]:   ABb  0a  1b  0c  1

RangeIndex#

RangeIndex is a sub-class ofIndex that provides the default index for allDataFrame andSeries objects.RangeIndex is an optimized version ofIndex that can represent a monotonic ordered set. These are analogous to Pythonrange types.ARangeIndex will always have anint64 dtype.

In [174]:idx=pd.RangeIndex(5)In [175]:idxOut[175]:RangeIndex(start=0, stop=5, step=1)

RangeIndex is the default index for allDataFrame andSeries objects:

In [176]:ser=pd.Series([1,2,3])In [177]:ser.indexOut[177]:RangeIndex(start=0, stop=3, step=1)In [178]:df=pd.DataFrame([[1,2],[3,4]])In [179]:df.indexOut[179]:RangeIndex(start=0, stop=2, step=1)In [180]:df.columnsOut[180]:RangeIndex(start=0, stop=2, step=1)

ARangeIndex will behave similarly to aIndex with anint64 dtype and operations on aRangeIndex,whose result cannot be represented by aRangeIndex, but should have an integer dtype, will be converted to anIndex withint64.For example:

In [181]:idx[[0,2]]Out[181]:Index([0, 2], dtype='int64')

IntervalIndex#

IntervalIndex together with its own dtype,IntervalDtypeas well as theInterval scalar type, allow first-class support in pandasfor interval notation.

TheIntervalIndex allows some unique indexing and is also used as areturn type for the categories incut() andqcut().

Indexing with anIntervalIndex#

AnIntervalIndex can be used inSeries and inDataFrame as the index.

In [182]:df=pd.DataFrame(   .....:{"A":[1,2,3,4]},index=pd.IntervalIndex.from_breaks([0,1,2,3,4])   .....:)   .....:In [183]:dfOut[183]:        A(0, 1]  1(1, 2]  2(2, 3]  3(3, 4]  4

Label based indexing via.loc along the edges of an interval works as you would expect,selecting that particular interval.

In [184]:df.loc[2]Out[184]:A    2Name: (1, 2], dtype: int64In [185]:df.loc[[2,3]]Out[185]:        A(1, 2]  2(2, 3]  3

If you select a labelcontained within an interval, this will also select the interval.

In [186]:df.loc[2.5]Out[186]:A    3Name: (2, 3], dtype: int64In [187]:df.loc[[2.5,3.5]]Out[187]:        A(2, 3]  3(3, 4]  4

Selecting using anInterval will only return exact matches.

In [188]:df.loc[pd.Interval(1,2)]Out[188]:A    2Name: (1, 2], dtype: int64

Trying to select anInterval that is not exactly contained in theIntervalIndex will raise aKeyError.

In [189]:df.loc[pd.Interval(0.5,2.5)]---------------------------------------------------------------------------KeyErrorTraceback (most recent call last)CellIn[189],line1---->1df.loc[pd.Interval(0.5,2.5)]File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in_LocationIndexer.__getitem__(self, key)1189maybe_callable=com.apply_if_callable(key,self.obj)1190maybe_callable=self._check_deprecated_callable_usage(key,maybe_callable)->1191returnself._getitem_axis(maybe_callable,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1431, in_LocIndexer._getitem_axis(self, key, axis)1429# fall thru to straight lookup1430self._validate_key(key,axis)->1431returnself._get_label(key,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1381, in_LocIndexer._get_label(self, label, axis)1379def_get_label(self,label,axis:AxisInt):1380# GH#5567 this will fail if the label is not present in the axis.->1381returnself.obj.xs(label,axis=axis)File ~/work/pandas/pandas/pandas/core/generic.py:4301, inNDFrame.xs(self, key, axis, level, drop_level)4299new_index=index[loc]4300else:->4301loc=index.get_loc(key)4303ifisinstance(loc,np.ndarray):4304ifloc.dtype==np.bool_:File ~/work/pandas/pandas/pandas/core/indexes/interval.py:678, inIntervalIndex.get_loc(self, key)676matches=mask.sum()677ifmatches==0:-->678raiseKeyError(key)679ifmatches==1:680returnmask.argmax()KeyError: Interval(0.5, 2.5, closed='right')

Selecting allIntervals that overlap a givenInterval can be performed using theoverlaps() method to create a boolean indexer.

In [190]:idxr=df.index.overlaps(pd.Interval(0.5,2.5))In [191]:idxrOut[191]:array([ True,  True,  True, False])In [192]:df[idxr]Out[192]:        A(0, 1]  1(1, 2]  2(2, 3]  3

Binning data withcut andqcut#

cut() andqcut() both return aCategorical object, and the bins theycreate are stored as anIntervalIndex in its.categories attribute.

In [193]:c=pd.cut(range(4),bins=2)In [194]:cOut[194]:[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]In [195]:c.categoriesOut[195]:IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]')

cut() also accepts anIntervalIndex for itsbins argument, which enablesa useful pandas idiom. First, We callcut() with some data andbins set to afixed number, to generate the bins. Then, we pass the values of.categories as thebins argument in subsequent calls tocut(), supplying new data which will bebinned into the same bins.

In [196]:pd.cut([0,3,5,1],bins=c.categories)Out[196]:[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

Any value which falls outside all bins will be assigned aNaN value.

Generating ranges of intervals#

If we need intervals on a regular frequency, we can use theinterval_range() functionto create anIntervalIndex using various combinations ofstart,end, andperiods.The default frequency forinterval_range is a 1 for numeric intervals, and calendar day fordatetime-like intervals:

In [197]:pd.interval_range(start=0,end=5)Out[197]:IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')In [198]:pd.interval_range(start=pd.Timestamp("2017-01-01"),periods=4)Out[198]:IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00],               (2017-01-02 00:00:00, 2017-01-03 00:00:00],               (2017-01-03 00:00:00, 2017-01-04 00:00:00],               (2017-01-04 00:00:00, 2017-01-05 00:00:00]],              dtype='interval[datetime64[ns], right]')In [199]:pd.interval_range(end=pd.Timedelta("3 days"),periods=3)Out[199]:IntervalIndex([(0 days 00:00:00, 1 days 00:00:00],               (1 days 00:00:00, 2 days 00:00:00],               (2 days 00:00:00, 3 days 00:00:00]],              dtype='interval[timedelta64[ns], right]')

Thefreq parameter can used to specify non-default frequencies, and can utilize a varietyoffrequency aliases with datetime-like intervals:

In [200]:pd.interval_range(start=0,periods=5,freq=1.5)Out[200]:IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')In [201]:pd.interval_range(start=pd.Timestamp("2017-01-01"),periods=4,freq="W")Out[201]:IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00],               (2017-01-08 00:00:00, 2017-01-15 00:00:00],               (2017-01-15 00:00:00, 2017-01-22 00:00:00],               (2017-01-22 00:00:00, 2017-01-29 00:00:00]],              dtype='interval[datetime64[ns], right]')In [202]:pd.interval_range(start=pd.Timedelta("0 days"),periods=3,freq="9h")Out[202]:IntervalIndex([(0 days 00:00:00, 0 days 09:00:00],               (0 days 09:00:00, 0 days 18:00:00],               (0 days 18:00:00, 1 days 03:00:00]],              dtype='interval[timedelta64[ns], right]')

Additionally, theclosed parameter can be used to specify which side(s) the intervalsare closed on. Intervals are closed on the right side by default.

In [203]:pd.interval_range(start=0,end=4,closed="both")Out[203]:IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')In [204]:pd.interval_range(start=0,end=4,closed="neither")Out[204]:IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]')

Specifyingstart,end, andperiods will generate a range of evenly spacedintervals fromstart toend inclusively, withperiods number of elementsin the resultingIntervalIndex:

In [205]:pd.interval_range(start=0,end=6,periods=4)Out[205]:IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')In [206]:pd.interval_range(pd.Timestamp("2018-01-01"),pd.Timestamp("2018-02-28"),periods=3)Out[206]:IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00],               (2018-01-20 08:00:00, 2018-02-08 16:00:00],               (2018-02-08 16:00:00, 2018-02-28 00:00:00]],              dtype='interval[datetime64[ns], right]')

Miscellaneous indexing FAQ#

Integer indexing#

Label-based indexing with integer axis labels is a thorny topic. It has beendiscussed heavily on mailing lists and among various members of the scientificPython community. In pandas, our general viewpoint is that labels matter morethan integer locations. Therefore, with an integer axis indexonlylabel-based indexing is possible with the standard tools like.loc. Thefollowing code will generate exceptions:

In [207]:s=pd.Series(range(5))In [208]:s[-1]---------------------------------------------------------------------------ValueErrorTraceback (most recent call last)File ~/work/pandas/pandas/pandas/core/indexes/range.py:413, inRangeIndex.get_loc(self, key)412try:-->413returnself._range.index(new_key)414exceptValueErroraserr:ValueError: -1 is not in rangeTheaboveexceptionwasthedirectcauseofthefollowingexception:KeyErrorTraceback (most recent call last)CellIn[208],line1---->1s[-1]File ~/work/pandas/pandas/pandas/core/series.py:1121, inSeries.__getitem__(self, key)1118returnself._values[key]1120elifkey_is_scalar:->1121returnself._get_value(key)1123# Convert generator to list before going through hashable part1124# (We will iterate through the generator there to check for slices)1125ifis_iterator(key):File ~/work/pandas/pandas/pandas/core/series.py:1237, inSeries._get_value(self, label, takeable)1234returnself._values[label]1236# Similar to Index.get_value, but we do not fall back to positional->1237loc=self.index.get_loc(label)1239ifis_integer(loc):1240returnself._values[loc]File ~/work/pandas/pandas/pandas/core/indexes/range.py:415, inRangeIndex.get_loc(self, key)413returnself._range.index(new_key)414exceptValueErroraserr:-->415raiseKeyError(key)fromerr416ifisinstance(key,Hashable):417raiseKeyError(key)KeyError: -1In [209]:df=pd.DataFrame(np.random.randn(5,4))In [210]:dfOut[210]:          0         1         2         30 -0.435772 -1.188928 -0.808286 -0.2846341 -1.815703  1.347213 -0.243487  0.5147042  1.162969 -0.287725 -0.179734  0.9939623 -0.212673  0.909872 -0.733333 -0.3498934  0.456434 -0.306735  0.553396  0.166221In [211]:df.loc[-2:]Out[211]:          0         1         2         30 -0.435772 -1.188928 -0.808286 -0.2846341 -1.815703  1.347213 -0.243487  0.5147042  1.162969 -0.287725 -0.179734  0.9939623 -0.212673  0.909872 -0.733333 -0.3498934  0.456434 -0.306735  0.553396  0.166221

This deliberate decision was made to prevent ambiguities and subtle bugs (manyusers reported finding bugs when the API change was made to stop “falling back”on position-based indexing).

Non-monotonic indexes require exact matches#

If the index of aSeries orDataFrame is monotonically increasing or decreasing, then the boundsof a label-based slice can be outside the range of the index, much like slice indexing anormal Pythonlist. Monotonicity of an index can be tested with theis_monotonic_increasing() andis_monotonic_decreasing() attributes.

In [212]:df=pd.DataFrame(index=[2,3,3,4,5],columns=["data"],data=list(range(5)))In [213]:df.index.is_monotonic_increasingOut[213]:True# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:In [214]:df.loc[0:4,:]Out[214]:   data2     03     13     24     3# slice is are outside the index, so empty DataFrame is returnedIn [215]:df.loc[13:15,:]Out[215]:Empty DataFrameColumns: [data]Index: []

On the other hand, if the index is not monotonic, then both slice bounds must beunique members of the index.

In [216]:df=pd.DataFrame(index=[2,3,1,4,3,5],columns=["data"],data=list(range(6)))In [217]:df.index.is_monotonic_increasingOut[217]:False# OK because 2 and 4 are in the indexIn [218]:df.loc[2:4,:]Out[218]:   data2     03     11     24     3
 # 0 is not in the indexIn [219]:df.loc[0:4,:]---------------------------------------------------------------------------KeyErrorTraceback (most recent call last)File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, inIndex.get_loc(self, key)3804try:->3805returnself._engine.get_loc(casted_key)3806exceptKeyErroraserr:File index.pyx:167, inpandas._libs.index.IndexEngine.get_loc()File index.pyx:191, inpandas._libs.index.IndexEngine.get_loc()File index.pyx:234, inpandas._libs.index.IndexEngine._get_loc_duplicates()File index.pyx:242, inpandas._libs.index.IndexEngine._maybe_get_bool_indexer()File index.pyx:134, inpandas._libs.index._unpack_bool_indexer()KeyError: 0Theaboveexceptionwasthedirectcauseofthefollowingexception:KeyErrorTraceback (most recent call last)CellIn[219],line1---->1df.loc[0:4,:]File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in_LocationIndexer.__getitem__(self, key)1182ifself._is_scalar_access(key):1183returnself.obj._get_value(*key,takeable=self._takeable)->1184returnself._getitem_tuple(key)1185else:1186# we by definition only have the 0th axis1187axis=self.axisor0File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in_LocIndexer._getitem_tuple(self, tup)1374ifself._multi_take_opportunity(tup):1375returnself._multi_take(tup)->1377returnself._getitem_tuple_same_dim(tup)File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in_LocationIndexer._getitem_tuple_same_dim(self, tup)1017ifcom.is_null_slice(key):1018continue->1020retval=getattr(retval,self.name)._getitem_axis(key,axis=i)1021# We should never have retval.ndim < self.ndim, as that should1022#  be handled by the _getitem_lowerdim call above.1023assertretval.ndim==self.ndimFile ~/work/pandas/pandas/pandas/core/indexing.py:1411, in_LocIndexer._getitem_axis(self, key, axis)1409ifisinstance(key,slice):1410self._validate_key(key,axis)->1411returnself._get_slice_axis(key,axis=axis)1412elifcom.is_bool_indexer(key):1413returnself._getbool_axis(key,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in_LocIndexer._get_slice_axis(self, slice_obj, axis)1440returnobj.copy(deep=False)1442labels=obj._get_axis(axis)->1443indexer=labels.slice_indexer(slice_obj.start,slice_obj.stop,slice_obj.step)1445ifisinstance(indexer,slice):1446returnself.obj._slice(indexer,axis=axis)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, inIndex.slice_indexer(self, start, end, step)6618defslice_indexer(6619self,6620start:Hashable|None=None,6621end:Hashable|None=None,6622step:int|None=None,6623)->slice:6624"""6625     Compute the slice indexer for input labels and step.6626   (...)6660     slice(1, 3, None)6661     """->6662start_slice,end_slice=self.slice_locs(start,end,step=step)6664# return a slice6665ifnotis_scalar(start_slice):File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, inIndex.slice_locs(self, start, end, step)6877start_slice=None6878ifstartisnotNone:->6879start_slice=self.get_slice_bound(start,"left")6880ifstart_sliceisNone:6881start_slice=0File ~/work/pandas/pandas/pandas/core/indexes/base.py:6804, inIndex.get_slice_bound(self, label, side)6801returnself._searchsorted_monotonic(label,side)6802exceptValueError:6803# raise the original KeyError->6804raiseerr6806ifisinstance(slc,np.ndarray):6807# get_loc may return a boolean array, which6808# is OK as long as they are representable by a slice.6809assertis_bool_dtype(slc.dtype)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6798, inIndex.get_slice_bound(self, label, side)6796# we need to look up the label6797try:->6798slc=self.get_loc(label)6799exceptKeyErroraserr:6800try:File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, inIndex.get_loc(self, key)3807ifisinstance(casted_key,slice)or(3808isinstance(casted_key,abc.Iterable)3809andany(isinstance(x,slice)forxincasted_key)3810):3811raiseInvalidIndexError(key)->3812raiseKeyError(key)fromerr3813exceptTypeError:3814# If we have a listlike key, _check_indexing_error will raise3815#  InvalidIndexError. Otherwise we fall through and re-raise3816#  the TypeError.3817self._check_indexing_error(key)KeyError: 0# 3 is not a unique labelIn [220]:df.loc[2:3,:]---------------------------------------------------------------------------KeyErrorTraceback (most recent call last)CellIn[220],line1---->1df.loc[2:3,:]File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in_LocationIndexer.__getitem__(self, key)1182ifself._is_scalar_access(key):1183returnself.obj._get_value(*key,takeable=self._takeable)->1184returnself._getitem_tuple(key)1185else:1186# we by definition only have the 0th axis1187axis=self.axisor0File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in_LocIndexer._getitem_tuple(self, tup)1374ifself._multi_take_opportunity(tup):1375returnself._multi_take(tup)->1377returnself._getitem_tuple_same_dim(tup)File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in_LocationIndexer._getitem_tuple_same_dim(self, tup)1017ifcom.is_null_slice(key):1018continue->1020retval=getattr(retval,self.name)._getitem_axis(key,axis=i)1021# We should never have retval.ndim < self.ndim, as that should1022#  be handled by the _getitem_lowerdim call above.1023assertretval.ndim==self.ndimFile ~/work/pandas/pandas/pandas/core/indexing.py:1411, in_LocIndexer._getitem_axis(self, key, axis)1409ifisinstance(key,slice):1410self._validate_key(key,axis)->1411returnself._get_slice_axis(key,axis=axis)1412elifcom.is_bool_indexer(key):1413returnself._getbool_axis(key,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in_LocIndexer._get_slice_axis(self, slice_obj, axis)1440returnobj.copy(deep=False)1442labels=obj._get_axis(axis)->1443indexer=labels.slice_indexer(slice_obj.start,slice_obj.stop,slice_obj.step)1445ifisinstance(indexer,slice):1446returnself.obj._slice(indexer,axis=axis)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, inIndex.slice_indexer(self, start, end, step)6618defslice_indexer(6619self,6620start:Hashable|None=None,6621end:Hashable|None=None,6622step:int|None=None,6623)->slice:6624"""6625     Compute the slice indexer for input labels and step.6626   (...)6660     slice(1, 3, None)6661     """->6662start_slice,end_slice=self.slice_locs(start,end,step=step)6664# return a slice6665ifnotis_scalar(start_slice):File ~/work/pandas/pandas/pandas/core/indexes/base.py:6885, inIndex.slice_locs(self, start, end, step)6883end_slice=None6884ifendisnotNone:->6885end_slice=self.get_slice_bound(end,"right")6886ifend_sliceisNone:6887end_slice=len(self)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6812, inIndex.get_slice_bound(self, label, side)6810slc=lib.maybe_booleans_to_slice(slc.view("u1"))6811ifisinstance(slc,np.ndarray):->6812raiseKeyError(6813f"Cannot get{side} slice bound for non-unique "6814f"label:{repr(original_label)}"6815)6817ifisinstance(slc,slice):6818ifside=="left":KeyError: 'Cannot get right slice bound for non-unique label: 3'

Index.is_monotonic_increasing andIndex.is_monotonic_decreasing only check thatan index is weakly monotonic. To check for strict monotonicity, you can combine one of those withtheis_unique() attribute.

In [221]:weakly_monotonic=pd.Index(["a","b","c","c"])In [222]:weakly_monotonicOut[222]:Index(['a', 'b', 'c', 'c'], dtype='object')In [223]:weakly_monotonic.is_monotonic_increasingOut[223]:TrueIn [224]:weakly_monotonic.is_monotonic_increasing&weakly_monotonic.is_uniqueOut[224]:False

Endpoints are inclusive#

Compared with standard Python sequence slicing in which the slice endpoint isnot inclusive, label-based slicing in pandasis inclusive. The primaryreason for this is that it is often not possible to easily determine the“successor” or next element after a particular label in an index. For example,consider the followingSeries:

In [225]:s=pd.Series(np.random.randn(6),index=list("abcdef"))In [226]:sOut[226]:a   -0.101684b   -0.734907c   -0.130121d   -0.476046e    0.759104f    0.213379dtype: float64

Suppose we wished to slice fromc toe, using integers this would beaccomplished as such:

In [227]:s[2:5]Out[227]:c   -0.130121d   -0.476046e    0.759104dtype: float64

However, if you only hadc ande, determining the next element in theindex can be somewhat complicated. For example, the following does not work:

In [228]:s.loc['c':'e'+1]---------------------------------------------------------------------------TypeErrorTraceback (most recent call last)CellIn[228],line1---->1s.loc['c':'e'+1]TypeError: can only concatenate str (not "int") to str

A very common use case is to limit a time series to start and end at twospecific dates. To enable this, we made the design choice to make label-basedslicing include both endpoints:

In [229]:s.loc["c":"e"]Out[229]:c   -0.130121d   -0.476046e    0.759104dtype: float64

This is most definitely a “practicality beats purity” sort of thing, but it issomething to watch out for if you expect label-based slicing to behave exactlyin the way that standard Python integer slicing works.

Indexing potentially changes underlying Series dtype#

The different indexing operation can potentially change the dtype of aSeries.

In [230]:series1=pd.Series([1,2,3])In [231]:series1.dtypeOut[231]:dtype('int64')In [232]:res=series1.reindex([0,4])In [233]:res.dtypeOut[233]:dtype('float64')In [234]:resOut[234]:0    1.04    NaNdtype: float64
In [235]:series2=pd.Series([True])In [236]:series2.dtypeOut[236]:dtype('bool')In [237]:res=series2.reindex_like(series1)In [238]:res.dtypeOut[238]:dtype('O')In [239]:resOut[239]:0    True1     NaN2     NaNdtype: object

This is because the (re)indexing operations above silently insertsNaNs and thedtypechanges accordingly. This can cause some issues when usingnumpyufuncssuch asnumpy.logical_and.

See theGH 2388 for a moredetailed discussion.


[8]ページ先頭

©2009-2025 Movatter.jp