- User Guide
- MultiIndex...
MultiIndex / advanced indexing#
This section coversindexing with a MultiIndexandother advanced indexing features.
See theIndexing and Selecting Data for general indexing documentation.
Warning
Whether a copy or a reference is returned for a setting operation maydepend on the context. This is sometimes calledchainedassignment andshould be avoided. SeeReturning a View versus Copy.
See thecookbook for some advanced strategies.
Hierarchical indexing (MultiIndex)#
Hierarchical / Multi-level indexing is very exciting as it opens the door to somequite sophisticated data analysis and manipulation, especially for working withhigher dimensional data. In essence, it enables you to store and manipulatedata with an arbitrary number of dimensions in lower dimensional datastructures likeSeries (1d) andDataFrame (2d).
In this section, we will show what exactly we mean by “hierarchical” indexingand how it integrates with all of the pandas indexing functionalitydescribed above and in prior sections. Later, when discussinggroup by andpivoting and reshaping data, we’ll shownon-trivial applications to illustrate how it aids in structuring data foranalysis.
See thecookbook for some advanced strategies.
Creating a MultiIndex (hierarchical index) object#
TheMultiIndex object is the hierarchical analogue of the standardIndex object which typically stores the axis labels in pandas objects. Youcan think ofMultiIndex as an array of tuples where each tuple is unique. AMultiIndex can be created from a list of arrays (usingMultiIndex.from_arrays()), an array of tuples (usingMultiIndex.from_tuples()), a crossed set of iterables (usingMultiIndex.from_product()), or aDataFrame (usingMultiIndex.from_frame()). TheIndex constructor will attempt to returnaMultiIndex when it is passed a list of tuples. The following examplesdemonstrate different ways to initialize MultiIndexes.
In [1]:arrays=[ ...:["bar","bar","baz","baz","foo","foo","qux","qux"], ...:["one","two","one","two","one","two","one","two"], ...:] ...:In [2]:tuples=list(zip(*arrays))In [3]:tuplesOut[3]:[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]In [4]:index=pd.MultiIndex.from_tuples(tuples,names=["first","second"])In [5]:indexOut[5]:MultiIndex([('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], names=['first', 'second'])In [6]:s=pd.Series(np.random.randn(8),index=index)In [7]:sOut[7]:first secondbar one 0.469112 two -0.282863baz one -1.509059 two -1.135632foo one 1.212112 two -0.173215qux one 0.119209 two -1.044236dtype: float64
When you want every pairing of the elements in two iterables, it can be easierto use theMultiIndex.from_product() method:
In [8]:iterables=[["bar","baz","foo","qux"],["one","two"]]In [9]:pd.MultiIndex.from_product(iterables,names=["first","second"])Out[9]:MultiIndex([('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], names=['first', 'second'])
You can also construct aMultiIndex from aDataFrame directly, usingthe methodMultiIndex.from_frame(). This is a complementary method toMultiIndex.to_frame().
In [10]:df=pd.DataFrame( ....:[["bar","one"],["bar","two"],["foo","one"],["foo","two"]], ....:columns=["first","second"], ....:) ....:In [11]:pd.MultiIndex.from_frame(df)Out[11]:MultiIndex([('bar', 'one'), ('bar', 'two'), ('foo', 'one'), ('foo', 'two')], names=['first', 'second'])
As a convenience, you can pass a list of arrays directly intoSeries orDataFrame to construct aMultiIndex automatically:
In [12]:arrays=[ ....:np.array(["bar","bar","baz","baz","foo","foo","qux","qux"]), ....:np.array(["one","two","one","two","one","two","one","two"]), ....:] ....:In [13]:s=pd.Series(np.random.randn(8),index=arrays)In [14]:sOut[14]:bar one -0.861849 two -2.104569baz one -0.494929 two 1.071804foo one 0.721555 two -0.706771qux one -1.039575 two 0.271860dtype: float64In [15]:df=pd.DataFrame(np.random.randn(8,4),index=arrays)In [16]:dfOut[16]: 0 1 2 3bar one -0.424972 0.567020 0.276232 -1.087401 two -0.673690 0.113648 -1.478427 0.524988baz one 0.404705 0.577046 -1.715002 -1.039268 two -0.370647 -1.157892 -1.344312 0.844885foo one 1.075770 -0.109050 1.643563 -1.469388 two 0.357021 -0.674600 -1.776904 -0.968914qux one -1.294524 0.413738 0.276662 -0.472035 two -0.013960 -0.362543 -0.006154 -0.923061
All of theMultiIndex constructors accept anames argument which storesstring names for the levels themselves. If no names are provided,None willbe assigned:
In [17]:df.index.namesOut[17]:FrozenList([None, None])
This index can back any axis of a pandas object, and the number oflevelsof the index is up to you:
In [18]:df=pd.DataFrame(np.random.randn(3,8),index=["A","B","C"],columns=index)In [19]:dfOut[19]:first bar baz ... foo quxsecond one two one ... two one twoA 0.895717 0.805244 -1.206412 ... 1.340309 -1.170299 -0.226169B 0.410835 0.813850 0.132003 ... -1.187678 1.130127 -1.436737C -1.413681 1.607920 1.024180 ... -2.211372 0.974466 -2.006747[3 rows x 8 columns]In [20]:pd.DataFrame(np.random.randn(6,6),index=index[:6],columns=index[:6])Out[20]:first bar baz foosecond one two one two one twofirst secondbar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804 two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738 two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232 two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441
We’ve “sparsified” the higher levels of the indexes to make the console output abit easier on the eyes. Note that how the index is displayed can be controlled using themulti_sparse option inpandas.set_options():
In [21]:withpd.option_context("display.multi_sparse",False): ....:df ....:
It’s worth keeping in mind that there’s nothing preventing you from usingtuples as atomic labels on an axis:
In [22]:pd.Series(np.random.randn(8),index=tuples)Out[22]:(bar, one) -1.236269(bar, two) 0.896171(baz, one) -0.487602(baz, two) -0.082240(foo, one) -2.182937(foo, two) 0.380396(qux, one) 0.084844(qux, two) 0.432390dtype: float64
The reason that theMultiIndex matters is that it can allow you to dogrouping, selection, and reshaping operations as we will describe below and insubsequent areas of the documentation. As you will see in later sections, youcan find yourself working with hierarchically-indexed data without creating aMultiIndex explicitly yourself. However, when loading data from a file, youmay wish to generate your ownMultiIndex when preparing the data set.
Reconstructing the level labels#
The methodget_level_values() will return a vector of the labels for eachlocation at a particular level:
In [23]:index.get_level_values(0)Out[23]:Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')In [24]:index.get_level_values("second")Out[24]:Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')
Basic indexing on axis with MultiIndex#
One of the important features of hierarchical indexing is that you can selectdata by a “partial” label identifying a subgroup in the data.Partialselection “drops” levels of the hierarchical index in the result in acompletely analogous way to selecting a column in a regular DataFrame:
In [25]:df["bar"]Out[25]:second one twoA 0.895717 0.805244B 0.410835 0.813850C -1.413681 1.607920In [26]:df["bar","one"]Out[26]:A 0.895717B 0.410835C -1.413681Name: (bar, one), dtype: float64In [27]:df["bar"]["one"]Out[27]:A 0.895717B 0.410835C -1.413681Name: one, dtype: float64In [28]:s["qux"]Out[28]:one -1.039575two 0.271860dtype: float64
SeeCross-section with hierarchical index for how to selecton a deeper level.
Defined levels#
TheMultiIndex keeps all the defined levels of an index, evenif they are not actually used. When slicing an index, you may notice this.For example:
In [29]:df.columns.levels# original MultiIndexOut[29]:FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])In [30]:df[["foo","qux"]].columns.levels# slicedOut[30]:FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
This is done to avoid a recomputation of the levels in order to make slicinghighly performant. If you want to see only the used levels, you can use theget_level_values() method.
In [31]:df[["foo","qux"]].columns.to_numpy()Out[31]:array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], dtype=object)# for a specific levelIn [32]:df[["foo","qux"]].columns.get_level_values(0)Out[32]:Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
To reconstruct theMultiIndex with only the used levels, theremove_unused_levels() method may be used.
In [33]:new_mi=df[["foo","qux"]].columns.remove_unused_levels()In [34]:new_mi.levelsOut[34]:FrozenList([['foo', 'qux'], ['one', 'two']])
Data alignment and usingreindex#
Operations between differently-indexed objects havingMultiIndex on theaxes will work as you expect; data alignment will work the same as an Index oftuples:
In [35]:s+s[:-2]Out[35]:bar one -1.723698 two -4.209138baz one -0.989859 two 2.143608foo one 1.443110 two -1.413542qux one NaN two NaNdtype: float64In [36]:s+s[::2]Out[36]:bar one -1.723698 two NaNbaz one -0.989859 two NaNfoo one 1.443110 two NaNqux one -2.079150 two NaNdtype: float64
Thereindex() method ofSeries/DataFrames can becalled with anotherMultiIndex, or even a list or array of tuples:
In [37]:s.reindex(index[:3])Out[37]:first secondbar one -0.861849 two -2.104569baz one -0.494929dtype: float64In [38]:s.reindex([("foo","two"),("bar","one"),("qux","one"),("baz","one")])Out[38]:foo two -0.706771bar one -0.861849qux one -1.039575baz one -0.494929dtype: float64
Advanced indexing with hierarchical index#
Syntactically integratingMultiIndex in advanced indexing with.loc is abit challenging, but we’ve made every effort to do so. In general, MultiIndexkeys take the form of tuples. For example, the following works as you would expect:
In [39]:df=df.TIn [40]:dfOut[40]: A B Cfirst secondbar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747In [41]:df.loc[("bar","two")]Out[41]:A 0.805244B 0.813850C 1.607920Name: (bar, two), dtype: float64
Note thatdf.loc['bar','two'] would also work in this example, but this shorthandnotation can lead to ambiguity in general.
If you also want to index a specific column with.loc, you must use a tuplelike this:
In [42]:df.loc[("bar","two"),"A"]Out[42]:0.8052440253863785
You don’t have to specify all levels of theMultiIndex by passing only thefirst elements of the tuple. For example, you can use “partial” indexing toget all elements withbar in the first level as follows:
In [43]:df.loc["bar"]Out[43]: A B Csecondone 0.895717 0.410835 -1.413681two 0.805244 0.813850 1.607920
This is a shortcut for the slightly more verbose notationdf.loc[('bar',),] (equivalenttodf.loc['bar',] in this example).
“Partial” slicing also works quite nicely.
In [44]:df.loc["baz":"foo"]Out[44]: A B Cfirst secondbaz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372
You can slice with a ‘range’ of values, by providing a slice of tuples.
In [45]:df.loc[("baz","two"):("qux","one")]Out[45]: A B Cfirst secondbaz two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372qux one -1.170299 1.130127 0.974466In [46]:df.loc[("baz","two"):"foo"]Out[46]: A B Cfirst secondbaz two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372
Passing a list of labels or tuples works similar to reindexing:
In [47]:df.loc[[("bar","two"),("qux","one")]]Out[47]: A B Cfirst secondbar two 0.805244 0.813850 1.607920qux one -1.170299 1.130127 0.974466
Note
It is important to note that tuples and lists are not treated identicallyin pandas when it comes to indexing. Whereas a tuple is interpreted as onemulti-level key, a list is used to specify several keys. Or in other words,tuples go horizontally (traversing levels), lists go vertically (scanning levels).
Importantly, a list of tuples indexes several completeMultiIndex keys,whereas a tuple of lists refer to several values within a level:
In [48]:s=pd.Series( ....:[1,2,3,4,5,6], ....:index=pd.MultiIndex.from_product([["A","B"],["c","d","e"]]), ....:) ....:In [49]:s.loc[[("A","c"),("B","d")]]# list of tuplesOut[49]:A c 1B d 5dtype: int64In [50]:s.loc[(["A","B"],["c","d"])]# tuple of listsOut[50]:A c 1 d 2B c 4 d 5dtype: int64
Using slicers#
You can slice aMultiIndex by providing multiple indexers.
You can provide any of the selectors as if you are indexing by label, seeSelection by Label,including slices, lists of labels, labels, and boolean indexers.
You can useslice(None) to select all the contents ofthat level. You do not need to specify all thedeeper levels, they will be implied asslice(None).
As usual,both sides of the slicers are included as this is label indexing.
Warning
You should specify all axes in the.loc specifier, meaning the indexer for theindex andfor thecolumns. There are some ambiguous cases where the passed indexer could be misinterpretedas indexingboth axes, rather than into say theMultiIndex for the rows.
You should do this:
df.loc[(slice("A1","A3"),...),:]# noqa: E999
You shouldnot do this:
df.loc[(slice("A1","A3"),...)]# noqa: E999
In [51]:defmklbl(prefix,n): ....:return["%s%s"%(prefix,i)foriinrange(n)] ....:In [52]:miindex=pd.MultiIndex.from_product( ....:[mklbl("A",4),mklbl("B",2),mklbl("C",4),mklbl("D",2)] ....:) ....:In [53]:micolumns=pd.MultiIndex.from_tuples( ....:[("a","foo"),("a","bar"),("b","foo"),("b","bah")],names=["lvl0","lvl1"] ....:) ....:In [54]:dfmi=( ....:pd.DataFrame( ....:np.arange(len(miindex)*len(micolumns)).reshape( ....:(len(miindex),len(micolumns)) ....:), ....:index=miindex, ....:columns=micolumns, ....:) ....:.sort_index() ....:.sort_index(axis=1) ....:) ....:In [55]:dfmiOut[55]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9 8 11 10 D1 13 12 15 14 C2 D0 17 16 19 18... ... ... ... ...A3 B1 C1 D1 237 236 239 238 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249 248 251 250 D1 253 252 255 254[64 rows x 4 columns]
Basic MultiIndex slicing using slices, lists, and labels.
In [56]:dfmi.loc[(slice("A1","A3"),slice(None),["C1","C3"]),:]Out[56]:lvl0 a blvl1 bar foo bah fooA1 B0 C1 D0 73 72 75 74 D1 77 76 79 78 C3 D0 89 88 91 90 D1 93 92 95 94 B1 C1 D0 105 104 107 106... ... ... ... ...A3 B0 C3 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254[24 rows x 4 columns]
You can usepandas.IndexSlice to facilitate a more natural syntaxusing:, rather than usingslice(None).
In [57]:idx=pd.IndexSliceIn [58]:dfmi.loc[idx[:,:,["C1","C3"]],idx[:,"foo"]]Out[58]:lvl0 a blvl1 foo fooA0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42... ... ...A3 B0 C3 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254[32 rows x 2 columns]
It is possible to perform quite complicated selections using this method on multipleaxes at the same time.
In [59]:dfmi.loc["A1",(slice(None),"foo")]Out[59]:lvl0 a blvl1 foo fooB0 C0 D0 64 66 D1 68 70 C1 D0 72 74 D1 76 78 C2 D0 80 82... ... ...B1 C1 D1 108 110 C2 D0 112 114 D1 116 118 C3 D0 120 122 D1 124 126[16 rows x 2 columns]In [60]:dfmi.loc[idx[:,:,["C1","C3"]],idx[:,"foo"]]Out[60]:lvl0 a blvl1 foo fooA0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42... ... ...A3 B0 C3 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254[32 rows x 2 columns]
Using a boolean indexer you can provide selection related to thevalues.
In [61]:mask=dfmi[("a","foo")]>200In [62]:dfmi.loc[idx[mask,:,["C1","C3"]],idx[:,"foo"]]Out[62]:lvl0 a blvl1 foo fooA3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254
You can also specify theaxis argument to.loc to interpret the passedslicers on a single axis.
In [63]:dfmi.loc(axis=0)[:,:,["C1","C3"]]Out[63]:lvl0 a blvl1 bar foo bah fooA0 B0 C1 D0 9 8 11 10 D1 13 12 15 14 C3 D0 25 24 27 26 D1 29 28 31 30 B1 C1 D0 41 40 43 42... ... ... ... ...A3 B0 C3 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254[32 rows x 4 columns]
Furthermore, you canset the values using the following methods.
In [64]:df2=dfmi.copy()In [65]:df2.loc(axis=0)[:,:,["C1","C3"]]=-10In [66]:df2Out[66]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 17 16 19 18... ... ... ... ...A3 B1 C1 D1 -10 -10 -10 -10 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10[64 rows x 4 columns]
You can use a right-hand-side of an alignable object as well.
In [67]:df2=dfmi.copy()In [68]:df2.loc[idx[:,:,["C1","C3"]],:]=df2*1000In [69]:df2Out[69]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9000 8000 11000 10000 D1 13000 12000 15000 14000 C2 D0 17 16 19 18... ... ... ... ...A3 B1 C1 D1 237000 236000 239000 238000 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249000 248000 251000 250000 D1 253000 252000 255000 254000[64 rows x 4 columns]
Cross-section#
Thexs() method ofDataFrame additionally takes a level argument to makeselecting data at a particular level of aMultiIndex easier.
In [70]:dfOut[70]: A B Cfirst secondbar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747In [71]:df.xs("one",level="second")Out[71]: A B Cfirstbar 0.895717 0.410835 -1.413681baz -1.206412 0.132003 1.024180foo 1.431256 -0.076467 0.875906qux -1.170299 1.130127 0.974466
# using the slicersIn [72]:df.loc[(slice(None),"one"),:]Out[72]: A B Cfirst secondbar one 0.895717 0.410835 -1.413681baz one -1.206412 0.132003 1.024180foo one 1.431256 -0.076467 0.875906qux one -1.170299 1.130127 0.974466
You can also select on the columns withxs, byproviding the axis argument.
In [73]:df=df.TIn [74]:df.xs("one",level="second",axis=1)Out[74]:first bar baz foo quxA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
# using the slicersIn [75]:df.loc[:,(slice(None),"one")]Out[75]:first bar baz foo quxsecond one one one oneA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
xs also allows selection with multiple keys.
In [76]:df.xs(("one","bar"),level=("second","first"),axis=1)Out[76]:first barsecond oneA 0.895717B 0.410835C -1.413681
# using the slicersIn [77]:df.loc[:,("bar","one")]Out[77]:A 0.895717B 0.410835C -1.413681Name: (bar, one), dtype: float64
You can passdrop_level=False toxs to retainthe level that was selected.
In [78]:df.xs("one",level="second",axis=1,drop_level=False)Out[78]:first bar baz foo quxsecond one one one oneA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
Compare the above with the result usingdrop_level=True (the default value).
In [79]:df.xs("one",level="second",axis=1,drop_level=True)Out[79]:first bar baz foo quxA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
Advanced reindexing and alignment#
Using the parameterlevel in thereindex() andalign() methods of pandas objects is useful to broadcastvalues across a level. For instance:
In [80]:midx=pd.MultiIndex( ....:levels=[["zero","one"],["x","y"]],codes=[[1,1,0,0],[1,0,1,0]] ....:) ....:In [81]:df=pd.DataFrame(np.random.randn(4,2),index=midx)In [82]:dfOut[82]: 0 1one y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520In [83]:df2=df.groupby(level=0).mean()In [84]:df2Out[84]: 0 1one 1.060074 -0.109716zero 1.271532 0.713416In [85]:df2.reindex(df.index,level=0)Out[85]: 0 1one y 1.060074 -0.109716 x 1.060074 -0.109716zero y 1.271532 0.713416 x 1.271532 0.713416# aligningIn [86]:df_aligned,df2_aligned=df.align(df2,level=0)In [87]:df_alignedOut[87]: 0 1one y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520In [88]:df2_alignedOut[88]: 0 1one y 1.060074 -0.109716 x 1.060074 -0.109716zero y 1.271532 0.713416 x 1.271532 0.713416
Swapping levels withswaplevel#
Theswaplevel() method can switch the order of two levels:
In [89]:df[:5]Out[89]: 0 1one y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520In [90]:df[:5].swaplevel(0,1,axis=0)Out[90]: 0 1y one 1.519970 -0.493662x one 0.600178 0.274230y zero 0.132885 -0.023688x zero 2.410179 1.450520
Reordering levels withreorder_levels#
Thereorder_levels() method generalizes theswaplevelmethod, allowing you to permute the hierarchical index levels in one step:
In [91]:df[:5].reorder_levels([1,0],axis=0)Out[91]: 0 1y one 1.519970 -0.493662x one 0.600178 0.274230y zero 0.132885 -0.023688x zero 2.410179 1.450520
Renaming names of anIndex orMultiIndex#
Therename() method is used to rename the labels of aMultiIndex, and is typically used to rename the columns of aDataFrame.Thecolumns argument ofrename allows a dictionary to be specifiedthat includes only the columns you wish to rename.
In [92]:df.rename(columns={0:"col0",1:"col1"})Out[92]: col0 col1one y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520
This method can also be used to rename specific labels of the main indexof theDataFrame.
In [93]:df.rename(index={"one":"two","y":"z"})Out[93]: 0 1two z 1.519970 -0.493662 x 0.600178 0.274230zero z 0.132885 -0.023688 x 2.410179 1.450520
Therename_axis() method is used to rename the name of aIndex orMultiIndex. In particular, the names of the levels of aMultiIndex can be specified, which is useful ifreset_index() is laterused to move the values from theMultiIndex to a column.
In [94]:df.rename_axis(index=["abc","def"])Out[94]: 0 1abc defone y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520
Note that the columns of aDataFrame are an index, so that usingrename_axis with thecolumns argument will change the name of thatindex.
In [95]:df.rename_axis(columns="Cols").columnsOut[95]:RangeIndex(start=0, stop=2, step=1, name='Cols')
Bothrename andrename_axis support specifying a dictionary,Series or a mapping function to map labels/names to new values.
When working with anIndex object directly, rather than via aDataFrame,Index.set_names() can be used to change the names.
In [96]:mi=pd.MultiIndex.from_product([[1,2],["a","b"]],names=["x","y"])In [97]:mi.namesOut[97]:FrozenList(['x', 'y'])In [98]:mi2=mi.rename("new name",level=0)In [99]:mi2Out[99]:MultiIndex([(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')], names=['new name', 'y'])
You cannot set the names of the MultiIndex via a level.
In [100]:mi.levels[0].name="name via level"---------------------------------------------------------------------------RuntimeErrorTraceback (most recent call last)CellIn[100],line1---->1mi.levels[0].name="name via level"File ~/work/pandas/pandas/pandas/core/indexes/base.py:1697, inIndex.name(self, value)1693@name.setter1694defname(self,value:Hashable)->None:1695ifself._no_setting_name:1696# Used in MultiIndex.levels to avoid silently ignoring name updates.->1697raiseRuntimeError(1698"Cannot set name on a level of a MultiIndex. Use "1699"'MultiIndex.set_names' instead."1700)1701maybe_extract_name(value,None,type(self))1702self._name=valueRuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.
UseIndex.set_names() instead.
Sorting aMultiIndex#
ForMultiIndex-ed objects to be indexed and sliced effectively,they need to be sorted. As with any index, you can usesort_index().
In [101]:importrandomIn [102]:random.shuffle(tuples)In [103]:s=pd.Series(np.random.randn(8),index=pd.MultiIndex.from_tuples(tuples))In [104]:sOut[104]:baz two 0.206053qux one -0.251905 two -2.213588bar two 1.063327 one 1.266143foo one 0.299368baz one -0.863838foo two 0.408204dtype: float64In [105]:s.sort_index()Out[105]:bar one 1.266143 two 1.063327baz one -0.863838 two 0.206053foo one 0.299368 two 0.408204qux one -0.251905 two -2.213588dtype: float64In [106]:s.sort_index(level=0)Out[106]:bar one 1.266143 two 1.063327baz one -0.863838 two 0.206053foo one 0.299368 two 0.408204qux one -0.251905 two -2.213588dtype: float64In [107]:s.sort_index(level=1)Out[107]:bar one 1.266143baz one -0.863838foo one 0.299368qux one -0.251905bar two 1.063327baz two 0.206053foo two 0.408204qux two -2.213588dtype: float64
You may also pass a level name tosort_index if theMultiIndex levelsare named.
In [108]:s.index=s.index.set_names(["L1","L2"])In [109]:s.sort_index(level="L1")Out[109]:L1 L2bar one 1.266143 two 1.063327baz one -0.863838 two 0.206053foo one 0.299368 two 0.408204qux one -0.251905 two -2.213588dtype: float64In [110]:s.sort_index(level="L2")Out[110]:L1 L2bar one 1.266143baz one -0.863838foo one 0.299368qux one -0.251905bar two 1.063327baz two 0.206053foo two 0.408204qux two -2.213588dtype: float64
On higher dimensional objects, you can sort any of the other axes by level ifthey have aMultiIndex:
In [111]:df.T.sort_index(level=1,axis=1)Out[111]: one zero one zero x x y y0 0.600178 2.410179 1.519970 0.1328851 0.274230 1.450520 -0.493662 -0.023688
Indexing will work even if the data are not sorted, but will be ratherinefficient (and show aPerformanceWarning). It will alsoreturn a copy of the data rather than a view:
In [112]:dfm=pd.DataFrame( .....:{"jim":[0,0,1,1],"joe":["x","x","z","y"],"jolie":np.random.rand(4)} .....:) .....:In [113]:dfm=dfm.set_index(["jim","joe"])In [114]:dfmOut[114]: joliejim joe0 x 0.490671 x 0.1202481 z 0.537020 y 0.110968In [115]:dfm.loc[(1,'z')]Out[115]: joliejim joe1 z 0.53702
Furthermore, if you try to index something that is not fully lexsorted, this can raise:
In [116]:dfm.loc[(0,'y'):(1,'z')]---------------------------------------------------------------------------UnsortedIndexErrorTraceback (most recent call last)CellIn[116],line1---->1dfm.loc[(0,'y'):(1,'z')]File ~/work/pandas/pandas/pandas/core/indexing.py:1192, in_LocationIndexer.__getitem__(self, key)1190maybe_callable=com.apply_if_callable(key,self.obj)1191maybe_callable=self._check_deprecated_callable_usage(key,maybe_callable)->1192returnself._getitem_axis(maybe_callable,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1412, in_LocIndexer._getitem_axis(self, key, axis)1410ifisinstance(key,slice):1411self._validate_key(key,axis)->1412returnself._get_slice_axis(key,axis=axis)1413elifcom.is_bool_indexer(key):1414returnself._getbool_axis(key,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1444, in_LocIndexer._get_slice_axis(self, slice_obj, axis)1441returnobj.copy(deep=False)1443labels=obj._get_axis(axis)->1444indexer=labels.slice_indexer(slice_obj.start,slice_obj.stop,slice_obj.step)1446ifisinstance(indexer,slice):1447returnself.obj._slice(indexer,axis=axis)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6708, inIndex.slice_indexer(self, start, end, step)6664defslice_indexer(6665self,6666start:Hashable|None=None,6667end:Hashable|None=None,6668step:int|None=None,6669)->slice:6670"""6671 Compute the slice indexer for input labels and step.6672 (...)6706 slice(1, 3, None)6707 """->6708start_slice,end_slice=self.slice_locs(start,end,step=step)6710# return a slice6711ifnotis_scalar(start_slice):File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2923, inMultiIndex.slice_locs(self, start, end, step)2871"""2872 For an ordered MultiIndex, compute the slice locations for input2873 labels. (...)2919 sequence of such.2920 """2921# This function adds nothing to its parent implementation (the magic2922# happens in get_slice_bound method), but it adds meaningful doc.->2923returnsuper().slice_locs(start,end,step)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6934, inIndex.slice_locs(self, start, end, step)6932start_slice=None6933ifstartisnotNone:->6934start_slice=self.get_slice_bound(start,"left")6935ifstart_sliceisNone:6936start_slice=0File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2867, inMultiIndex.get_slice_bound(self, label, side)2865ifnotisinstance(label,tuple):2866label=(label,)->2867returnself._partial_tup_index(label,side=side)File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2927, inMultiIndex._partial_tup_index(self, tup, side)2925def_partial_tup_index(self,tup:tuple,side:Literal["left","right"]="left"):2926iflen(tup)>self._lexsort_depth:->2927raiseUnsortedIndexError(2928f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "2929f"({self._lexsort_depth})"2930)2932n=len(tup)2933start,end=0,len(self)UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
Theis_monotonic_increasing() method on aMultiIndex shows if theindex is sorted:
In [117]:dfm.index.is_monotonic_increasingOut[117]:False
In [118]:dfm=dfm.sort_index()In [119]:dfmOut[119]: joliejim joe0 x 0.490671 x 0.1202481 y 0.110968 z 0.537020In [120]:dfm.index.is_monotonic_increasingOut[120]:True
And now selection works as expected.
In [121]:dfm.loc[(0,"y"):(1,"z")]Out[121]: joliejim joe1 y 0.110968 z 0.537020
Take methods#
Similar to NumPy ndarrays, pandasIndex,Series, andDataFrame also providesthetake() method that retrieves elements along a given axis at the givenindices. The given indices must be either a list or an ndarray of integerindex positions.take will also accept negative integers as relative positions to the end of the object.
In [122]:index=pd.Index(np.random.randint(0,1000,10))In [123]:indexOut[123]:Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')In [124]:positions=[0,9,3]In [125]:index[positions]Out[125]:Index([214, 329, 567], dtype='int64')In [126]:index.take(positions)Out[126]:Index([214, 329, 567], dtype='int64')In [127]:ser=pd.Series(np.random.randn(10))In [128]:ser.iloc[positions]Out[128]:0 -0.1796669 1.8243753 0.392149dtype: float64In [129]:ser.take(positions)Out[129]:0 -0.1796669 1.8243753 0.392149dtype: float64
For DataFrames, the given indices should be a 1d list or ndarray that specifiesrow or column positions.
In [130]:frm=pd.DataFrame(np.random.randn(5,3))In [131]:frm.take([1,4,3])Out[131]: 0 1 21 -1.237881 0.106854 -1.2768294 0.629675 -1.425966 1.8577043 0.979542 -1.633678 0.615855In [132]:frm.take([0,2],axis=1)Out[132]: 0 20 0.595974 0.6015441 -1.237881 -1.2768292 -0.767101 1.4995913 0.979542 0.6158554 0.629675 1.857704
It is important to note that thetake method on pandas objects are notintended to work on boolean indices and may return unexpected results.
In [133]:arr=np.random.randn(10)In [134]:arr.take([False,False,True,True])Out[134]:array([-1.1935, -1.1935, 0.6775, 0.6775])In [135]:arr[[0,1]]Out[135]:array([-1.1935, 0.6775])In [136]:ser=pd.Series(np.random.randn(10))In [137]:ser.take([False,False,True,True])Out[137]:0 0.2331410 0.2331411 -0.2235401 -0.223540dtype: float64In [138]:ser.iloc[[0,1]]Out[138]:0 0.2331411 -0.223540dtype: float64
Finally, as a small note on performance, because thetake method handlesa narrower range of inputs, it can offer performance that is a good dealfaster than fancy indexing.
In [139]:arr=np.random.randn(10000,5)In [140]:indexer=np.arange(10000)In [141]:random.shuffle(indexer)In [142]:%timeit arr[indexer] .....:%timeit arr.take(indexer, axis=0) .....:249 us +- 3.23 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)74.4 us +- 1.42 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
In [143]:ser=pd.Series(arr[:,0])In [144]:%timeit ser.iloc[indexer] .....:%timeit ser.take(indexer) .....:150 us +- 3.01 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)139 us +- 20.3 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
Index types#
We have discussedMultiIndex in the previous sections pretty extensively.Documentation aboutDatetimeIndex andPeriodIndex are shownhere,and documentation aboutTimedeltaIndex is foundhere.
In the following sub-sections we will highlight some other index types.
CategoricalIndex#
CategoricalIndex is a type of index that is useful for supportingindexing with duplicates. This is a container around aCategoricaland allows efficient indexing and storage of an index with a large number of duplicated elements.
In [145]:frompandas.api.typesimportCategoricalDtypeIn [146]:df=pd.DataFrame({"A":np.arange(6),"B":list("aabbca")})In [147]:df["B"]=df["B"].astype(CategoricalDtype(list("cab")))In [148]:dfOut[148]: A B0 0 a1 1 a2 2 b3 3 b4 4 c5 5 aIn [149]:df.dtypesOut[149]:A int64B categorydtype: objectIn [150]:df["B"].cat.categoriesOut[150]:Index(['c', 'a', 'b'], dtype='object')
Setting the index will create aCategoricalIndex.
In [151]:df2=df.set_index("B")In [152]:df2.indexOut[152]:CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
Indexing with__getitem__/.iloc/.loc works similarly to anIndex with duplicates.The indexersmust be in the category or the operation will raise aKeyError.
In [153]:df2.loc["a"]Out[153]: ABa 0a 1a 5
TheCategoricalIndex ispreserved after indexing:
In [154]:df2.loc["a"].indexOut[154]:CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
Sorting the index will sort by the order of the categories (recall that wecreated the index withCategoricalDtype(list('cab')), so the sortedorder iscab).
In [155]:df2.sort_index()Out[155]: ABc 4a 0a 1a 5b 2b 3
Groupby operations on the index will preserve the index nature as well.
In [156]:df2.groupby(level=0,observed=True).sum()Out[156]: ABc 4a 6b 5In [157]:df2.groupby(level=0,observed=True).sum().indexOut[157]:CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
Reindexing operations will return a resulting index based on the type of the passedindexer. Passing a list will return a plain-oldIndex; indexing withaCategorical will return aCategoricalIndex, indexed according to the categoriesof thepassedCategorical dtype. This allows one to arbitrarily index these even withvaluesnot in the categories, similarly to how you can reindexany pandas index.
In [158]:df3=pd.DataFrame( .....:{"A":np.arange(3),"B":pd.Series(list("abc")).astype("category")} .....:) .....:In [159]:df3=df3.set_index("B")In [160]:df3Out[160]: ABa 0b 1c 2
In [161]:df3.reindex(["a","e"])Out[161]: ABa 0.0e NaNIn [162]:df3.reindex(["a","e"]).indexOut[162]:Index(['a', 'e'], dtype='object', name='B')In [163]:df3.reindex(pd.Categorical(["a","e"],categories=list("abe")))Out[163]: ABa 0.0e NaNIn [164]:df3.reindex(pd.Categorical(["a","e"],categories=list("abe"))).indexOut[164]:CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B')
Warning
Reshaping and Comparison operations on aCategoricalIndex must have the same categoriesor aTypeError will be raised.
In [165]:df4=pd.DataFrame({"A":np.arange(2),"B":list("ba")})In [166]:df4["B"]=df4["B"].astype(CategoricalDtype(list("ab")))In [167]:df4=df4.set_index("B")In [168]:df4.indexOut[168]:CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')In [169]:df5=pd.DataFrame({"A":np.arange(2),"B":list("bc")})In [170]:df5["B"]=df5["B"].astype(CategoricalDtype(list("bc")))In [171]:df5=df5.set_index("B")In [172]:df5.indexOut[172]:CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B')
In [173]:pd.concat([df4,df5])Out[173]: ABb 0a 1b 0c 1
RangeIndex#
RangeIndex is a sub-class ofIndex that provides the default index for allDataFrame andSeries objects.RangeIndex is an optimized version ofIndex that can represent a monotonic ordered set. These are analogous to Pythonrange types.ARangeIndex will always have anint64 dtype.
In [174]:idx=pd.RangeIndex(5)In [175]:idxOut[175]:RangeIndex(start=0, stop=5, step=1)
RangeIndex is the default index for allDataFrame andSeries objects:
In [176]:ser=pd.Series([1,2,3])In [177]:ser.indexOut[177]:RangeIndex(start=0, stop=3, step=1)In [178]:df=pd.DataFrame([[1,2],[3,4]])In [179]:df.indexOut[179]:RangeIndex(start=0, stop=2, step=1)In [180]:df.columnsOut[180]:RangeIndex(start=0, stop=2, step=1)
ARangeIndex will behave similarly to aIndex with anint64 dtype and operations on aRangeIndex,whose result cannot be represented by aRangeIndex, but should have an integer dtype, will be converted to anIndex withint64.For example:
In [181]:idx[[0,2]]Out[181]:Index([0, 2], dtype='int64')
IntervalIndex#
IntervalIndex together with its own dtype,IntervalDtypeas well as theInterval scalar type, allow first-class support in pandasfor interval notation.
TheIntervalIndex allows some unique indexing and is also used as areturn type for the categories incut() andqcut().
Indexing with anIntervalIndex#
AnIntervalIndex can be used inSeries and inDataFrame as the index.
In [182]:df=pd.DataFrame( .....:{"A":[1,2,3,4]},index=pd.IntervalIndex.from_breaks([0,1,2,3,4]) .....:) .....:In [183]:dfOut[183]: A(0, 1] 1(1, 2] 2(2, 3] 3(3, 4] 4
Label based indexing via.loc along the edges of an interval works as you would expect,selecting that particular interval.
In [184]:df.loc[2]Out[184]:A 2Name: (1, 2], dtype: int64In [185]:df.loc[[2,3]]Out[185]: A(1, 2] 2(2, 3] 3
If you select a labelcontained within an interval, this will also select the interval.
In [186]:df.loc[2.5]Out[186]:A 3Name: (2, 3], dtype: int64In [187]:df.loc[[2.5,3.5]]Out[187]: A(2, 3] 3(3, 4] 4
Selecting using anInterval will only return exact matches.
In [188]:df.loc[pd.Interval(1,2)]Out[188]:A 2Name: (1, 2], dtype: int64
Trying to select anInterval that is not exactly contained in theIntervalIndex will raise aKeyError.
In [189]:df.loc[pd.Interval(0.5,2.5)]---------------------------------------------------------------------------KeyErrorTraceback (most recent call last)CellIn[189],line1---->1df.loc[pd.Interval(0.5,2.5)]File ~/work/pandas/pandas/pandas/core/indexing.py:1192, in_LocationIndexer.__getitem__(self, key)1190maybe_callable=com.apply_if_callable(key,self.obj)1191maybe_callable=self._check_deprecated_callable_usage(key,maybe_callable)->1192returnself._getitem_axis(maybe_callable,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1432, in_LocIndexer._getitem_axis(self, key, axis)1430# fall thru to straight lookup1431self._validate_key(key,axis)->1432returnself._get_label(key,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1382, in_LocIndexer._get_label(self, label, axis)1380def_get_label(self,label,axis:AxisInt):1381# GH#5567 this will fail if the label is not present in the axis.->1382returnself.obj.xs(label,axis=axis)File ~/work/pandas/pandas/pandas/core/generic.py:4323, inNDFrame.xs(self, key, axis, level, drop_level)4321new_index=index[loc]4322else:->4323loc=index.get_loc(key)4325ifisinstance(loc,np.ndarray):4326ifloc.dtype==np.bool_:File ~/work/pandas/pandas/pandas/core/indexes/interval.py:679, inIntervalIndex.get_loc(self, key)677matches=mask.sum()678ifmatches==0:-->679raiseKeyError(key)680ifmatches==1:681returnmask.argmax()KeyError: Interval(0.5, 2.5, closed='right')
Selecting allIntervals that overlap a givenInterval can be performed using theoverlaps() method to create a boolean indexer.
In [190]:idxr=df.index.overlaps(pd.Interval(0.5,2.5))In [191]:idxrOut[191]:array([ True, True, True, False])In [192]:df[idxr]Out[192]: A(0, 1] 1(1, 2] 2(2, 3] 3
Binning data withcut andqcut#
cut() andqcut() both return aCategorical object, and the bins theycreate are stored as anIntervalIndex in its.categories attribute.
In [193]:c=pd.cut(range(4),bins=2)In [194]:cOut[194]:[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]In [195]:c.categoriesOut[195]:IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]')
cut() also accepts anIntervalIndex for itsbins argument, which enablesa useful pandas idiom. First, We callcut() with some data andbins set to afixed number, to generate the bins. Then, we pass the values of.categories as thebins argument in subsequent calls tocut(), supplying new data which will bebinned into the same bins.
In [196]:pd.cut([0,3,5,1],bins=c.categories)Out[196]:[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]
Any value which falls outside all bins will be assigned aNaN value.
Generating ranges of intervals#
If we need intervals on a regular frequency, we can use theinterval_range() functionto create anIntervalIndex using various combinations ofstart,end, andperiods.The default frequency forinterval_range is a 1 for numeric intervals, and calendar day fordatetime-like intervals:
In [197]:pd.interval_range(start=0,end=5)Out[197]:IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')In [198]:pd.interval_range(start=pd.Timestamp("2017-01-01"),periods=4)Out[198]:IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00], (2017-01-02 00:00:00, 2017-01-03 00:00:00], (2017-01-03 00:00:00, 2017-01-04 00:00:00], (2017-01-04 00:00:00, 2017-01-05 00:00:00]], dtype='interval[datetime64[ns], right]')In [199]:pd.interval_range(end=pd.Timedelta("3 days"),periods=3)Out[199]:IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]], dtype='interval[timedelta64[ns], right]')
Thefreq parameter can used to specify non-default frequencies, and can utilize a varietyoffrequency aliases with datetime-like intervals:
In [200]:pd.interval_range(start=0,periods=5,freq=1.5)Out[200]:IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')In [201]:pd.interval_range(start=pd.Timestamp("2017-01-01"),periods=4,freq="W")Out[201]:IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00], (2017-01-08 00:00:00, 2017-01-15 00:00:00], (2017-01-15 00:00:00, 2017-01-22 00:00:00], (2017-01-22 00:00:00, 2017-01-29 00:00:00]], dtype='interval[datetime64[ns], right]')In [202]:pd.interval_range(start=pd.Timedelta("0 days"),periods=3,freq="9h")Out[202]:IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]], dtype='interval[timedelta64[ns], right]')
Additionally, theclosed parameter can be used to specify which side(s) the intervalsare closed on. Intervals are closed on the right side by default.
In [203]:pd.interval_range(start=0,end=4,closed="both")Out[203]:IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')In [204]:pd.interval_range(start=0,end=4,closed="neither")Out[204]:IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]')
Specifyingstart,end, andperiods will generate a range of evenly spacedintervals fromstart toend inclusively, withperiods number of elementsin the resultingIntervalIndex:
In [205]:pd.interval_range(start=0,end=6,periods=4)Out[205]:IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')In [206]:pd.interval_range(pd.Timestamp("2018-01-01"),pd.Timestamp("2018-02-28"),periods=3)Out[206]:IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28 00:00:00]], dtype='interval[datetime64[ns], right]')
Miscellaneous indexing FAQ#
Integer indexing#
Label-based indexing with integer axis labels is a thorny topic. It has beendiscussed heavily on mailing lists and among various members of the scientificPython community. In pandas, our general viewpoint is that labels matter morethan integer locations. Therefore, with an integer axis indexonlylabel-based indexing is possible with the standard tools like.loc. Thefollowing code will generate exceptions:
In [207]:s=pd.Series(range(5))In [208]:s[-1]---------------------------------------------------------------------------ValueErrorTraceback (most recent call last)File ~/work/pandas/pandas/pandas/core/indexes/range.py:413, inRangeIndex.get_loc(self, key)412try:-->413returnself._range.index(new_key)414exceptValueErroraserr:ValueError: -1 is not in rangeTheaboveexceptionwasthedirectcauseofthefollowingexception:KeyErrorTraceback (most recent call last)CellIn[208],line1---->1s[-1]File ~/work/pandas/pandas/pandas/core/series.py:1133, inSeries.__getitem__(self, key)1130returnself._values[key]1132elifkey_is_scalar:->1133returnself._get_value(key)1135# Convert generator to list before going through hashable part1136# (We will iterate through the generator there to check for slices)1137ifis_iterator(key):File ~/work/pandas/pandas/pandas/core/series.py:1249, inSeries._get_value(self, label, takeable)1246returnself._values[label]1248# Similar to Index.get_value, but we do not fall back to positional->1249loc=self.index.get_loc(label)1251ifis_integer(loc):1252returnself._values[loc]File ~/work/pandas/pandas/pandas/core/indexes/range.py:415, inRangeIndex.get_loc(self, key)413returnself._range.index(new_key)414exceptValueErroraserr:-->415raiseKeyError(key)fromerr416ifisinstance(key,Hashable):417raiseKeyError(key)KeyError: -1In [209]:df=pd.DataFrame(np.random.randn(5,4))In [210]:dfOut[210]: 0 1 2 30 -0.435772 -1.188928 -0.808286 -0.2846341 -1.815703 1.347213 -0.243487 0.5147042 1.162969 -0.287725 -0.179734 0.9939623 -0.212673 0.909872 -0.733333 -0.3498934 0.456434 -0.306735 0.553396 0.166221In [211]:df.loc[-2:]Out[211]: 0 1 2 30 -0.435772 -1.188928 -0.808286 -0.2846341 -1.815703 1.347213 -0.243487 0.5147042 1.162969 -0.287725 -0.179734 0.9939623 -0.212673 0.909872 -0.733333 -0.3498934 0.456434 -0.306735 0.553396 0.166221
This deliberate decision was made to prevent ambiguities and subtle bugs (manyusers reported finding bugs when the API change was made to stop “falling back”on position-based indexing).
Non-monotonic indexes require exact matches#
If the index of aSeries orDataFrame is monotonically increasing or decreasing, then the boundsof a label-based slice can be outside the range of the index, much like slice indexing anormal Pythonlist. Monotonicity of an index can be tested with theis_monotonic_increasing() andis_monotonic_decreasing() attributes.
In [212]:df=pd.DataFrame(index=[2,3,3,4,5],columns=["data"],data=list(range(5)))In [213]:df.index.is_monotonic_increasingOut[213]:True# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:In [214]:df.loc[0:4,:]Out[214]: data2 03 13 24 3# slice is are outside the index, so empty DataFrame is returnedIn [215]:df.loc[13:15,:]Out[215]:Empty DataFrameColumns: [data]Index: []
On the other hand, if the index is not monotonic, then both slice bounds must beunique members of the index.
In [216]:df=pd.DataFrame(index=[2,3,1,4,3,5],columns=["data"],data=list(range(6)))In [217]:df.index.is_monotonic_increasingOut[217]:False# OK because 2 and 4 are in the indexIn [218]:df.loc[2:4,:]Out[218]: data2 03 11 24 3
# 0 is not in the indexIn [219]:df.loc[0:4,:]---------------------------------------------------------------------------KeyErrorTraceback (most recent call last)File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, inIndex.get_loc(self, key)3811try:->3812returnself._engine.get_loc(casted_key)3813exceptKeyErroraserr:File ~/work/pandas/pandas/pandas/_libs/index.pyx:167, inpandas._libs.index.IndexEngine.get_loc()File ~/work/pandas/pandas/pandas/_libs/index.pyx:191, inpandas._libs.index.IndexEngine.get_loc()File ~/work/pandas/pandas/pandas/_libs/index.pyx:234, inpandas._libs.index.IndexEngine._get_loc_duplicates()File ~/work/pandas/pandas/pandas/_libs/index.pyx:242, inpandas._libs.index.IndexEngine._maybe_get_bool_indexer()File ~/work/pandas/pandas/pandas/_libs/index.pyx:134, inpandas._libs.index._unpack_bool_indexer()KeyError: 0Theaboveexceptionwasthedirectcauseofthefollowingexception:KeyErrorTraceback (most recent call last)CellIn[219],line1---->1df.loc[0:4,:]File ~/work/pandas/pandas/pandas/core/indexing.py:1185, in_LocationIndexer.__getitem__(self, key)1183ifself._is_scalar_access(key):1184returnself.obj._get_value(*key,takeable=self._takeable)->1185returnself._getitem_tuple(key)1186else:1187# we by definition only have the 0th axis1188axis=self.axisor0File ~/work/pandas/pandas/pandas/core/indexing.py:1378, in_LocIndexer._getitem_tuple(self, tup)1375ifself._multi_take_opportunity(tup):1376returnself._multi_take(tup)->1378returnself._getitem_tuple_same_dim(tup)File ~/work/pandas/pandas/pandas/core/indexing.py:1021, in_LocationIndexer._getitem_tuple_same_dim(self, tup)1018ifcom.is_null_slice(key):1019continue->1021retval=getattr(retval,self.name)._getitem_axis(key,axis=i)1022# We should never have retval.ndim < self.ndim, as that should1023# be handled by the _getitem_lowerdim call above.1024assertretval.ndim==self.ndimFile ~/work/pandas/pandas/pandas/core/indexing.py:1412, in_LocIndexer._getitem_axis(self, key, axis)1410ifisinstance(key,slice):1411self._validate_key(key,axis)->1412returnself._get_slice_axis(key,axis=axis)1413elifcom.is_bool_indexer(key):1414returnself._getbool_axis(key,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1444, in_LocIndexer._get_slice_axis(self, slice_obj, axis)1441returnobj.copy(deep=False)1443labels=obj._get_axis(axis)->1444indexer=labels.slice_indexer(slice_obj.start,slice_obj.stop,slice_obj.step)1446ifisinstance(indexer,slice):1447returnself.obj._slice(indexer,axis=axis)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6708, inIndex.slice_indexer(self, start, end, step)6664defslice_indexer(6665self,6666start:Hashable|None=None,6667end:Hashable|None=None,6668step:int|None=None,6669)->slice:6670"""6671 Compute the slice indexer for input labels and step.6672 (...)6706 slice(1, 3, None)6707 """->6708start_slice,end_slice=self.slice_locs(start,end,step=step)6710# return a slice6711ifnotis_scalar(start_slice):File ~/work/pandas/pandas/pandas/core/indexes/base.py:6934, inIndex.slice_locs(self, start, end, step)6932start_slice=None6933ifstartisnotNone:->6934start_slice=self.get_slice_bound(start,"left")6935ifstart_sliceisNone:6936start_slice=0File ~/work/pandas/pandas/pandas/core/indexes/base.py:6859, inIndex.get_slice_bound(self, label, side)6856returnself._searchsorted_monotonic(label,side)6857exceptValueError:6858# raise the original KeyError->6859raiseerr6861ifisinstance(slc,np.ndarray):6862# get_loc may return a boolean array, which6863# is OK as long as they are representable by a slice.6864assertis_bool_dtype(slc.dtype)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6853, inIndex.get_slice_bound(self, label, side)6851# we need to look up the label6852try:->6853slc=self.get_loc(label)6854exceptKeyErroraserr:6855try:File ~/work/pandas/pandas/pandas/core/indexes/base.py:3819, inIndex.get_loc(self, key)3814ifisinstance(casted_key,slice)or(3815isinstance(casted_key,abc.Iterable)3816andany(isinstance(x,slice)forxincasted_key)3817):3818raiseInvalidIndexError(key)->3819raiseKeyError(key)fromerr3820exceptTypeError:3821# If we have a listlike key, _check_indexing_error will raise3822# InvalidIndexError. Otherwise we fall through and re-raise3823# the TypeError.3824self._check_indexing_error(key)KeyError: 0# 3 is not a unique labelIn [220]:df.loc[2:3,:]---------------------------------------------------------------------------KeyErrorTraceback (most recent call last)CellIn[220],line1---->1df.loc[2:3,:]File ~/work/pandas/pandas/pandas/core/indexing.py:1185, in_LocationIndexer.__getitem__(self, key)1183ifself._is_scalar_access(key):1184returnself.obj._get_value(*key,takeable=self._takeable)->1185returnself._getitem_tuple(key)1186else:1187# we by definition only have the 0th axis1188axis=self.axisor0File ~/work/pandas/pandas/pandas/core/indexing.py:1378, in_LocIndexer._getitem_tuple(self, tup)1375ifself._multi_take_opportunity(tup):1376returnself._multi_take(tup)->1378returnself._getitem_tuple_same_dim(tup)File ~/work/pandas/pandas/pandas/core/indexing.py:1021, in_LocationIndexer._getitem_tuple_same_dim(self, tup)1018ifcom.is_null_slice(key):1019continue->1021retval=getattr(retval,self.name)._getitem_axis(key,axis=i)1022# We should never have retval.ndim < self.ndim, as that should1023# be handled by the _getitem_lowerdim call above.1024assertretval.ndim==self.ndimFile ~/work/pandas/pandas/pandas/core/indexing.py:1412, in_LocIndexer._getitem_axis(self, key, axis)1410ifisinstance(key,slice):1411self._validate_key(key,axis)->1412returnself._get_slice_axis(key,axis=axis)1413elifcom.is_bool_indexer(key):1414returnself._getbool_axis(key,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1444, in_LocIndexer._get_slice_axis(self, slice_obj, axis)1441returnobj.copy(deep=False)1443labels=obj._get_axis(axis)->1444indexer=labels.slice_indexer(slice_obj.start,slice_obj.stop,slice_obj.step)1446ifisinstance(indexer,slice):1447returnself.obj._slice(indexer,axis=axis)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6708, inIndex.slice_indexer(self, start, end, step)6664defslice_indexer(6665self,6666start:Hashable|None=None,6667end:Hashable|None=None,6668step:int|None=None,6669)->slice:6670"""6671 Compute the slice indexer for input labels and step.6672 (...)6706 slice(1, 3, None)6707 """->6708start_slice,end_slice=self.slice_locs(start,end,step=step)6710# return a slice6711ifnotis_scalar(start_slice):File ~/work/pandas/pandas/pandas/core/indexes/base.py:6940, inIndex.slice_locs(self, start, end, step)6938end_slice=None6939ifendisnotNone:->6940end_slice=self.get_slice_bound(end,"right")6941ifend_sliceisNone:6942end_slice=len(self)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6867, inIndex.get_slice_bound(self, label, side)6865slc=lib.maybe_booleans_to_slice(slc.view("u1"))6866ifisinstance(slc,np.ndarray):->6867raiseKeyError(6868f"Cannot get{side} slice bound for non-unique "6869f"label:{repr(original_label)}"6870)6872ifisinstance(slc,slice):6873ifside=="left":KeyError: 'Cannot get right slice bound for non-unique label: 3'
Index.is_monotonic_increasing andIndex.is_monotonic_decreasing only check thatan index is weakly monotonic. To check for strict monotonicity, you can combine one of those withtheis_unique() attribute.
In [221]:weakly_monotonic=pd.Index(["a","b","c","c"])In [222]:weakly_monotonicOut[222]:Index(['a', 'b', 'c', 'c'], dtype='object')In [223]:weakly_monotonic.is_monotonic_increasingOut[223]:TrueIn [224]:weakly_monotonic.is_monotonic_increasing&weakly_monotonic.is_uniqueOut[224]:False
Endpoints are inclusive#
Compared with standard Python sequence slicing in which the slice endpoint isnot inclusive, label-based slicing in pandasis inclusive. The primaryreason for this is that it is often not possible to easily determine the“successor” or next element after a particular label in an index. For example,consider the followingSeries:
In [225]:s=pd.Series(np.random.randn(6),index=list("abcdef"))In [226]:sOut[226]:a -0.101684b -0.734907c -0.130121d -0.476046e 0.759104f 0.213379dtype: float64
Suppose we wished to slice fromc toe, using integers this would beaccomplished as such:
In [227]:s[2:5]Out[227]:c -0.130121d -0.476046e 0.759104dtype: float64
However, if you only hadc ande, determining the next element in theindex can be somewhat complicated. For example, the following does not work:
In [228]:s.loc['c':'e'+1]---------------------------------------------------------------------------TypeErrorTraceback (most recent call last)CellIn[228],line1---->1s.loc['c':'e'+1]TypeError: can only concatenate str (not "int") to str
A very common use case is to limit a time series to start and end at twospecific dates. To enable this, we made the design choice to make label-basedslicing include both endpoints:
In [229]:s.loc["c":"e"]Out[229]:c -0.130121d -0.476046e 0.759104dtype: float64
This is most definitely a “practicality beats purity” sort of thing, but it issomething to watch out for if you expect label-based slicing to behave exactlyin the way that standard Python integer slicing works.
Indexing potentially changes underlying Series dtype#
The different indexing operation can potentially change the dtype of aSeries.
In [230]:series1=pd.Series([1,2,3])In [231]:series1.dtypeOut[231]:dtype('int64')In [232]:res=series1.reindex([0,4])In [233]:res.dtypeOut[233]:dtype('float64')In [234]:resOut[234]:0 1.04 NaNdtype: float64
In [235]:series2=pd.Series([True])In [236]:series2.dtypeOut[236]:dtype('bool')In [237]:res=series2.reindex_like(series1)In [238]:res.dtypeOut[238]:dtype('O')In [239]:resOut[239]:0 True1 NaN2 NaNdtype: object
This is because the (re)indexing operations above silently insertsNaNs and thedtypechanges accordingly. This can cause some issues when usingnumpyufuncssuch asnumpy.logical_and.
See theGH 2388 for a moredetailed discussion.