- User Guide
- MultiIndex...
MultiIndex / advanced indexing#
This section coversindexing with a MultiIndexandother advanced indexing features.
See theIndexing and Selecting Data for general indexing documentation.
Warning
Whether a copy or a reference is returned for a setting operation maydepend on the context. This is sometimes calledchainedassignment
andshould be avoided. SeeReturning a View versus Copy.
See thecookbook for some advanced strategies.
Hierarchical indexing (MultiIndex)#
Hierarchical / Multi-level indexing is very exciting as it opens the door to somequite sophisticated data analysis and manipulation, especially for working withhigher dimensional data. In essence, it enables you to store and manipulatedata with an arbitrary number of dimensions in lower dimensional datastructures likeSeries
(1d) andDataFrame
(2d).
In this section, we will show what exactly we mean by “hierarchical” indexingand how it integrates with all of the pandas indexing functionalitydescribed above and in prior sections. Later, when discussinggroup by andpivoting and reshaping data, we’ll shownon-trivial applications to illustrate how it aids in structuring data foranalysis.
See thecookbook for some advanced strategies.
Creating a MultiIndex (hierarchical index) object#
TheMultiIndex
object is the hierarchical analogue of the standardIndex
object which typically stores the axis labels in pandas objects. Youcan think ofMultiIndex
as an array of tuples where each tuple is unique. AMultiIndex
can be created from a list of arrays (usingMultiIndex.from_arrays()
), an array of tuples (usingMultiIndex.from_tuples()
), a crossed set of iterables (usingMultiIndex.from_product()
), or aDataFrame
(usingMultiIndex.from_frame()
). TheIndex
constructor will attempt to returnaMultiIndex
when it is passed a list of tuples. The following examplesdemonstrate different ways to initialize MultiIndexes.
In [1]:arrays=[ ...:["bar","bar","baz","baz","foo","foo","qux","qux"], ...:["one","two","one","two","one","two","one","two"], ...:] ...:In [2]:tuples=list(zip(*arrays))In [3]:tuplesOut[3]:[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]In [4]:index=pd.MultiIndex.from_tuples(tuples,names=["first","second"])In [5]:indexOut[5]:MultiIndex([('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], names=['first', 'second'])In [6]:s=pd.Series(np.random.randn(8),index=index)In [7]:sOut[7]:first secondbar one 0.469112 two -0.282863baz one -1.509059 two -1.135632foo one 1.212112 two -0.173215qux one 0.119209 two -1.044236dtype: float64
When you want every pairing of the elements in two iterables, it can be easierto use theMultiIndex.from_product()
method:
In [8]:iterables=[["bar","baz","foo","qux"],["one","two"]]In [9]:pd.MultiIndex.from_product(iterables,names=["first","second"])Out[9]:MultiIndex([('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], names=['first', 'second'])
You can also construct aMultiIndex
from aDataFrame
directly, usingthe methodMultiIndex.from_frame()
. This is a complementary method toMultiIndex.to_frame()
.
In [10]:df=pd.DataFrame( ....:[["bar","one"],["bar","two"],["foo","one"],["foo","two"]], ....:columns=["first","second"], ....:) ....:In [11]:pd.MultiIndex.from_frame(df)Out[11]:MultiIndex([('bar', 'one'), ('bar', 'two'), ('foo', 'one'), ('foo', 'two')], names=['first', 'second'])
As a convenience, you can pass a list of arrays directly intoSeries
orDataFrame
to construct aMultiIndex
automatically:
In [12]:arrays=[ ....:np.array(["bar","bar","baz","baz","foo","foo","qux","qux"]), ....:np.array(["one","two","one","two","one","two","one","two"]), ....:] ....:In [13]:s=pd.Series(np.random.randn(8),index=arrays)In [14]:sOut[14]:bar one -0.861849 two -2.104569baz one -0.494929 two 1.071804foo one 0.721555 two -0.706771qux one -1.039575 two 0.271860dtype: float64In [15]:df=pd.DataFrame(np.random.randn(8,4),index=arrays)In [16]:dfOut[16]: 0 1 2 3bar one -0.424972 0.567020 0.276232 -1.087401 two -0.673690 0.113648 -1.478427 0.524988baz one 0.404705 0.577046 -1.715002 -1.039268 two -0.370647 -1.157892 -1.344312 0.844885foo one 1.075770 -0.109050 1.643563 -1.469388 two 0.357021 -0.674600 -1.776904 -0.968914qux one -1.294524 0.413738 0.276662 -0.472035 two -0.013960 -0.362543 -0.006154 -0.923061
All of theMultiIndex
constructors accept anames
argument which storesstring names for the levels themselves. If no names are provided,None
willbe assigned:
In [17]:df.index.namesOut[17]:FrozenList([None, None])
This index can back any axis of a pandas object, and the number oflevelsof the index is up to you:
In [18]:df=pd.DataFrame(np.random.randn(3,8),index=["A","B","C"],columns=index)In [19]:dfOut[19]:first bar baz ... foo quxsecond one two one ... two one twoA 0.895717 0.805244 -1.206412 ... 1.340309 -1.170299 -0.226169B 0.410835 0.813850 0.132003 ... -1.187678 1.130127 -1.436737C -1.413681 1.607920 1.024180 ... -2.211372 0.974466 -2.006747[3 rows x 8 columns]In [20]:pd.DataFrame(np.random.randn(6,6),index=index[:6],columns=index[:6])Out[20]:first bar baz foosecond one two one two one twofirst secondbar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804 two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738 two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232 two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441
We’ve “sparsified” the higher levels of the indexes to make the console output abit easier on the eyes. Note that how the index is displayed can be controlled using themulti_sparse
option inpandas.set_options()
:
In [21]:withpd.option_context("display.multi_sparse",False): ....:df ....:
It’s worth keeping in mind that there’s nothing preventing you from usingtuples as atomic labels on an axis:
In [22]:pd.Series(np.random.randn(8),index=tuples)Out[22]:(bar, one) -1.236269(bar, two) 0.896171(baz, one) -0.487602(baz, two) -0.082240(foo, one) -2.182937(foo, two) 0.380396(qux, one) 0.084844(qux, two) 0.432390dtype: float64
The reason that theMultiIndex
matters is that it can allow you to dogrouping, selection, and reshaping operations as we will describe below and insubsequent areas of the documentation. As you will see in later sections, youcan find yourself working with hierarchically-indexed data without creating aMultiIndex
explicitly yourself. However, when loading data from a file, youmay wish to generate your ownMultiIndex
when preparing the data set.
Reconstructing the level labels#
The methodget_level_values()
will return a vector of the labels for eachlocation at a particular level:
In [23]:index.get_level_values(0)Out[23]:Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')In [24]:index.get_level_values("second")Out[24]:Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')
Basic indexing on axis with MultiIndex#
One of the important features of hierarchical indexing is that you can selectdata by a “partial” label identifying a subgroup in the data.Partialselection “drops” levels of the hierarchical index in the result in acompletely analogous way to selecting a column in a regular DataFrame:
In [25]:df["bar"]Out[25]:second one twoA 0.895717 0.805244B 0.410835 0.813850C -1.413681 1.607920In [26]:df["bar","one"]Out[26]:A 0.895717B 0.410835C -1.413681Name: (bar, one), dtype: float64In [27]:df["bar"]["one"]Out[27]:A 0.895717B 0.410835C -1.413681Name: one, dtype: float64In [28]:s["qux"]Out[28]:one -1.039575two 0.271860dtype: float64
SeeCross-section with hierarchical index for how to selecton a deeper level.
Defined levels#
TheMultiIndex
keeps all the defined levels of an index, evenif they are not actually used. When slicing an index, you may notice this.For example:
In [29]:df.columns.levels# original MultiIndexOut[29]:FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])In [30]:df[["foo","qux"]].columns.levels# slicedOut[30]:FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
This is done to avoid a recomputation of the levels in order to make slicinghighly performant. If you want to see only the used levels, you can use theget_level_values()
method.
In [31]:df[["foo","qux"]].columns.to_numpy()Out[31]:array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], dtype=object)# for a specific levelIn [32]:df[["foo","qux"]].columns.get_level_values(0)Out[32]:Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
To reconstruct theMultiIndex
with only the used levels, theremove_unused_levels()
method may be used.
In [33]:new_mi=df[["foo","qux"]].columns.remove_unused_levels()In [34]:new_mi.levelsOut[34]:FrozenList([['foo', 'qux'], ['one', 'two']])
Data alignment and usingreindex
#
Operations between differently-indexed objects havingMultiIndex
on theaxes will work as you expect; data alignment will work the same as an Index oftuples:
In [35]:s+s[:-2]Out[35]:bar one -1.723698 two -4.209138baz one -0.989859 two 2.143608foo one 1.443110 two -1.413542qux one NaN two NaNdtype: float64In [36]:s+s[::2]Out[36]:bar one -1.723698 two NaNbaz one -0.989859 two NaNfoo one 1.443110 two NaNqux one -2.079150 two NaNdtype: float64
Thereindex()
method ofSeries
/DataFrames
can becalled with anotherMultiIndex
, or even a list or array of tuples:
In [37]:s.reindex(index[:3])Out[37]:first secondbar one -0.861849 two -2.104569baz one -0.494929dtype: float64In [38]:s.reindex([("foo","two"),("bar","one"),("qux","one"),("baz","one")])Out[38]:foo two -0.706771bar one -0.861849qux one -1.039575baz one -0.494929dtype: float64
Advanced indexing with hierarchical index#
Syntactically integratingMultiIndex
in advanced indexing with.loc
is abit challenging, but we’ve made every effort to do so. In general, MultiIndexkeys take the form of tuples. For example, the following works as you would expect:
In [39]:df=df.TIn [40]:dfOut[40]: A B Cfirst secondbar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747In [41]:df.loc[("bar","two")]Out[41]:A 0.805244B 0.813850C 1.607920Name: (bar, two), dtype: float64
Note thatdf.loc['bar','two']
would also work in this example, but this shorthandnotation can lead to ambiguity in general.
If you also want to index a specific column with.loc
, you must use a tuplelike this:
In [42]:df.loc[("bar","two"),"A"]Out[42]:0.8052440253863785
You don’t have to specify all levels of theMultiIndex
by passing only thefirst elements of the tuple. For example, you can use “partial” indexing toget all elements withbar
in the first level as follows:
In [43]:df.loc["bar"]Out[43]: A B Csecondone 0.895717 0.410835 -1.413681two 0.805244 0.813850 1.607920
This is a shortcut for the slightly more verbose notationdf.loc[('bar',),]
(equivalenttodf.loc['bar',]
in this example).
“Partial” slicing also works quite nicely.
In [44]:df.loc["baz":"foo"]Out[44]: A B Cfirst secondbaz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372
You can slice with a ‘range’ of values, by providing a slice of tuples.
In [45]:df.loc[("baz","two"):("qux","one")]Out[45]: A B Cfirst secondbaz two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372qux one -1.170299 1.130127 0.974466In [46]:df.loc[("baz","two"):"foo"]Out[46]: A B Cfirst secondbaz two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372
Passing a list of labels or tuples works similar to reindexing:
In [47]:df.loc[[("bar","two"),("qux","one")]]Out[47]: A B Cfirst secondbar two 0.805244 0.813850 1.607920qux one -1.170299 1.130127 0.974466
Note
It is important to note that tuples and lists are not treated identicallyin pandas when it comes to indexing. Whereas a tuple is interpreted as onemulti-level key, a list is used to specify several keys. Or in other words,tuples go horizontally (traversing levels), lists go vertically (scanning levels).
Importantly, a list of tuples indexes several completeMultiIndex
keys,whereas a tuple of lists refer to several values within a level:
In [48]:s=pd.Series( ....:[1,2,3,4,5,6], ....:index=pd.MultiIndex.from_product([["A","B"],["c","d","e"]]), ....:) ....:In [49]:s.loc[[("A","c"),("B","d")]]# list of tuplesOut[49]:A c 1B d 5dtype: int64In [50]:s.loc[(["A","B"],["c","d"])]# tuple of listsOut[50]:A c 1 d 2B c 4 d 5dtype: int64
Using slicers#
You can slice aMultiIndex
by providing multiple indexers.
You can provide any of the selectors as if you are indexing by label, seeSelection by Label,including slices, lists of labels, labels, and boolean indexers.
You can useslice(None)
to select all the contents ofthat level. You do not need to specify all thedeeper levels, they will be implied asslice(None)
.
As usual,both sides of the slicers are included as this is label indexing.
Warning
You should specify all axes in the.loc
specifier, meaning the indexer for theindex andfor thecolumns. There are some ambiguous cases where the passed indexer could be misinterpretedas indexingboth axes, rather than into say theMultiIndex
for the rows.
You should do this:
df.loc[(slice("A1","A3"),...),:]# noqa: E999
You shouldnot do this:
df.loc[(slice("A1","A3"),...)]# noqa: E999
In [51]:defmklbl(prefix,n): ....:return["%s%s"%(prefix,i)foriinrange(n)] ....:In [52]:miindex=pd.MultiIndex.from_product( ....:[mklbl("A",4),mklbl("B",2),mklbl("C",4),mklbl("D",2)] ....:) ....:In [53]:micolumns=pd.MultiIndex.from_tuples( ....:[("a","foo"),("a","bar"),("b","foo"),("b","bah")],names=["lvl0","lvl1"] ....:) ....:In [54]:dfmi=( ....:pd.DataFrame( ....:np.arange(len(miindex)*len(micolumns)).reshape( ....:(len(miindex),len(micolumns)) ....:), ....:index=miindex, ....:columns=micolumns, ....:) ....:.sort_index() ....:.sort_index(axis=1) ....:) ....:In [55]:dfmiOut[55]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9 8 11 10 D1 13 12 15 14 C2 D0 17 16 19 18... ... ... ... ...A3 B1 C1 D1 237 236 239 238 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249 248 251 250 D1 253 252 255 254[64 rows x 4 columns]
Basic MultiIndex slicing using slices, lists, and labels.
In [56]:dfmi.loc[(slice("A1","A3"),slice(None),["C1","C3"]),:]Out[56]:lvl0 a blvl1 bar foo bah fooA1 B0 C1 D0 73 72 75 74 D1 77 76 79 78 C3 D0 89 88 91 90 D1 93 92 95 94 B1 C1 D0 105 104 107 106... ... ... ... ...A3 B0 C3 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254[24 rows x 4 columns]
You can usepandas.IndexSlice
to facilitate a more natural syntaxusing:
, rather than usingslice(None)
.
In [57]:idx=pd.IndexSliceIn [58]:dfmi.loc[idx[:,:,["C1","C3"]],idx[:,"foo"]]Out[58]:lvl0 a blvl1 foo fooA0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42... ... ...A3 B0 C3 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254[32 rows x 2 columns]
It is possible to perform quite complicated selections using this method on multipleaxes at the same time.
In [59]:dfmi.loc["A1",(slice(None),"foo")]Out[59]:lvl0 a blvl1 foo fooB0 C0 D0 64 66 D1 68 70 C1 D0 72 74 D1 76 78 C2 D0 80 82... ... ...B1 C1 D1 108 110 C2 D0 112 114 D1 116 118 C3 D0 120 122 D1 124 126[16 rows x 2 columns]In [60]:dfmi.loc[idx[:,:,["C1","C3"]],idx[:,"foo"]]Out[60]:lvl0 a blvl1 foo fooA0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42... ... ...A3 B0 C3 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254[32 rows x 2 columns]
Using a boolean indexer you can provide selection related to thevalues.
In [61]:mask=dfmi[("a","foo")]>200In [62]:dfmi.loc[idx[mask,:,["C1","C3"]],idx[:,"foo"]]Out[62]:lvl0 a blvl1 foo fooA3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254
You can also specify theaxis
argument to.loc
to interpret the passedslicers on a single axis.
In [63]:dfmi.loc(axis=0)[:,:,["C1","C3"]]Out[63]:lvl0 a blvl1 bar foo bah fooA0 B0 C1 D0 9 8 11 10 D1 13 12 15 14 C3 D0 25 24 27 26 D1 29 28 31 30 B1 C1 D0 41 40 43 42... ... ... ... ...A3 B0 C3 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254[32 rows x 4 columns]
Furthermore, you canset the values using the following methods.
In [64]:df2=dfmi.copy()In [65]:df2.loc(axis=0)[:,:,["C1","C3"]]=-10In [66]:df2Out[66]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 17 16 19 18... ... ... ... ...A3 B1 C1 D1 -10 -10 -10 -10 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10[64 rows x 4 columns]
You can use a right-hand-side of an alignable object as well.
In [67]:df2=dfmi.copy()In [68]:df2.loc[idx[:,:,["C1","C3"]],:]=df2*1000In [69]:df2Out[69]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9000 8000 11000 10000 D1 13000 12000 15000 14000 C2 D0 17 16 19 18... ... ... ... ...A3 B1 C1 D1 237000 236000 239000 238000 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249000 248000 251000 250000 D1 253000 252000 255000 254000[64 rows x 4 columns]
Cross-section#
Thexs()
method ofDataFrame
additionally takes a level argument to makeselecting data at a particular level of aMultiIndex
easier.
In [70]:dfOut[70]: A B Cfirst secondbar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747In [71]:df.xs("one",level="second")Out[71]: A B Cfirstbar 0.895717 0.410835 -1.413681baz -1.206412 0.132003 1.024180foo 1.431256 -0.076467 0.875906qux -1.170299 1.130127 0.974466
# using the slicersIn [72]:df.loc[(slice(None),"one"),:]Out[72]: A B Cfirst secondbar one 0.895717 0.410835 -1.413681baz one -1.206412 0.132003 1.024180foo one 1.431256 -0.076467 0.875906qux one -1.170299 1.130127 0.974466
You can also select on the columns withxs
, byproviding the axis argument.
In [73]:df=df.TIn [74]:df.xs("one",level="second",axis=1)Out[74]:first bar baz foo quxA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
# using the slicersIn [75]:df.loc[:,(slice(None),"one")]Out[75]:first bar baz foo quxsecond one one one oneA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
xs
also allows selection with multiple keys.
In [76]:df.xs(("one","bar"),level=("second","first"),axis=1)Out[76]:first barsecond oneA 0.895717B 0.410835C -1.413681
# using the slicersIn [77]:df.loc[:,("bar","one")]Out[77]:A 0.895717B 0.410835C -1.413681Name: (bar, one), dtype: float64
You can passdrop_level=False
toxs
to retainthe level that was selected.
In [78]:df.xs("one",level="second",axis=1,drop_level=False)Out[78]:first bar baz foo quxsecond one one one oneA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
Compare the above with the result usingdrop_level=True
(the default value).
In [79]:df.xs("one",level="second",axis=1,drop_level=True)Out[79]:first bar baz foo quxA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
Advanced reindexing and alignment#
Using the parameterlevel
in thereindex()
andalign()
methods of pandas objects is useful to broadcastvalues across a level. For instance:
In [80]:midx=pd.MultiIndex( ....:levels=[["zero","one"],["x","y"]],codes=[[1,1,0,0],[1,0,1,0]] ....:) ....:In [81]:df=pd.DataFrame(np.random.randn(4,2),index=midx)In [82]:dfOut[82]: 0 1one y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520In [83]:df2=df.groupby(level=0).mean()In [84]:df2Out[84]: 0 1one 1.060074 -0.109716zero 1.271532 0.713416In [85]:df2.reindex(df.index,level=0)Out[85]: 0 1one y 1.060074 -0.109716 x 1.060074 -0.109716zero y 1.271532 0.713416 x 1.271532 0.713416# aligningIn [86]:df_aligned,df2_aligned=df.align(df2,level=0)In [87]:df_alignedOut[87]: 0 1one y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520In [88]:df2_alignedOut[88]: 0 1one y 1.060074 -0.109716 x 1.060074 -0.109716zero y 1.271532 0.713416 x 1.271532 0.713416
Swapping levels withswaplevel
#
Theswaplevel()
method can switch the order of two levels:
In [89]:df[:5]Out[89]: 0 1one y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520In [90]:df[:5].swaplevel(0,1,axis=0)Out[90]: 0 1y one 1.519970 -0.493662x one 0.600178 0.274230y zero 0.132885 -0.023688x zero 2.410179 1.450520
Reordering levels withreorder_levels
#
Thereorder_levels()
method generalizes theswaplevel
method, allowing you to permute the hierarchical index levels in one step:
In [91]:df[:5].reorder_levels([1,0],axis=0)Out[91]: 0 1y one 1.519970 -0.493662x one 0.600178 0.274230y zero 0.132885 -0.023688x zero 2.410179 1.450520
Renaming names of anIndex
orMultiIndex
#
Therename()
method is used to rename the labels of aMultiIndex
, and is typically used to rename the columns of aDataFrame
.Thecolumns
argument ofrename
allows a dictionary to be specifiedthat includes only the columns you wish to rename.
In [92]:df.rename(columns={0:"col0",1:"col1"})Out[92]: col0 col1one y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520
This method can also be used to rename specific labels of the main indexof theDataFrame
.
In [93]:df.rename(index={"one":"two","y":"z"})Out[93]: 0 1two z 1.519970 -0.493662 x 0.600178 0.274230zero z 0.132885 -0.023688 x 2.410179 1.450520
Therename_axis()
method is used to rename the name of aIndex
orMultiIndex
. In particular, the names of the levels of aMultiIndex
can be specified, which is useful ifreset_index()
is laterused to move the values from theMultiIndex
to a column.
In [94]:df.rename_axis(index=["abc","def"])Out[94]: 0 1abc defone y 1.519970 -0.493662 x 0.600178 0.274230zero y 0.132885 -0.023688 x 2.410179 1.450520
Note that the columns of aDataFrame
are an index, so that usingrename_axis
with thecolumns
argument will change the name of thatindex.
In [95]:df.rename_axis(columns="Cols").columnsOut[95]:RangeIndex(start=0, stop=2, step=1, name='Cols')
Bothrename
andrename_axis
support specifying a dictionary,Series
or a mapping function to map labels/names to new values.
When working with anIndex
object directly, rather than via aDataFrame
,Index.set_names()
can be used to change the names.
In [96]:mi=pd.MultiIndex.from_product([[1,2],["a","b"]],names=["x","y"])In [97]:mi.namesOut[97]:FrozenList(['x', 'y'])In [98]:mi2=mi.rename("new name",level=0)In [99]:mi2Out[99]:MultiIndex([(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')], names=['new name', 'y'])
You cannot set the names of the MultiIndex via a level.
In [100]:mi.levels[0].name="name via level"---------------------------------------------------------------------------RuntimeErrorTraceback (most recent call last)CellIn[100],line1---->1mi.levels[0].name="name via level"File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, inIndex.name(self, value)1686@name.setter1687defname(self,value:Hashable)->None:1688ifself._no_setting_name:1689# Used in MultiIndex.levels to avoid silently ignoring name updates.->1690raiseRuntimeError(1691"Cannot set name on a level of a MultiIndex. Use "1692"'MultiIndex.set_names' instead."1693)1694maybe_extract_name(value,None,type(self))1695self._name=valueRuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.
UseIndex.set_names()
instead.
Sorting aMultiIndex
#
ForMultiIndex
-ed objects to be indexed and sliced effectively,they need to be sorted. As with any index, you can usesort_index()
.
In [101]:importrandomIn [102]:random.shuffle(tuples)In [103]:s=pd.Series(np.random.randn(8),index=pd.MultiIndex.from_tuples(tuples))In [104]:sOut[104]:baz one 0.206053bar one -0.251905baz two -2.213588qux two 1.063327bar two 1.266143qux one 0.299368foo two -0.863838 one 0.408204dtype: float64In [105]:s.sort_index()Out[105]:bar one -0.251905 two 1.266143baz one 0.206053 two -2.213588foo one 0.408204 two -0.863838qux one 0.299368 two 1.063327dtype: float64In [106]:s.sort_index(level=0)Out[106]:bar one -0.251905 two 1.266143baz one 0.206053 two -2.213588foo one 0.408204 two -0.863838qux one 0.299368 two 1.063327dtype: float64In [107]:s.sort_index(level=1)Out[107]:bar one -0.251905baz one 0.206053foo one 0.408204qux one 0.299368bar two 1.266143baz two -2.213588foo two -0.863838qux two 1.063327dtype: float64
You may also pass a level name tosort_index
if theMultiIndex
levelsare named.
In [108]:s.index=s.index.set_names(["L1","L2"])In [109]:s.sort_index(level="L1")Out[109]:L1 L2bar one -0.251905 two 1.266143baz one 0.206053 two -2.213588foo one 0.408204 two -0.863838qux one 0.299368 two 1.063327dtype: float64In [110]:s.sort_index(level="L2")Out[110]:L1 L2bar one -0.251905baz one 0.206053foo one 0.408204qux one 0.299368bar two 1.266143baz two -2.213588foo two -0.863838qux two 1.063327dtype: float64
On higher dimensional objects, you can sort any of the other axes by level ifthey have aMultiIndex
:
In [111]:df.T.sort_index(level=1,axis=1)Out[111]: one zero one zero x x y y0 0.600178 2.410179 1.519970 0.1328851 0.274230 1.450520 -0.493662 -0.023688
Indexing will work even if the data are not sorted, but will be ratherinefficient (and show aPerformanceWarning
). It will alsoreturn a copy of the data rather than a view:
In [112]:dfm=pd.DataFrame( .....:{"jim":[0,0,1,1],"joe":["x","x","z","y"],"jolie":np.random.rand(4)} .....:) .....:In [113]:dfm=dfm.set_index(["jim","joe"])In [114]:dfmOut[114]: joliejim joe0 x 0.490671 x 0.1202481 z 0.537020 y 0.110968In [115]:dfm.loc[(1,'z')]Out[115]: joliejim joe1 z 0.53702
Furthermore, if you try to index something that is not fully lexsorted, this can raise:
In [116]:dfm.loc[(0,'y'):(1,'z')]---------------------------------------------------------------------------UnsortedIndexErrorTraceback (most recent call last)CellIn[116],line1---->1dfm.loc[(0,'y'):(1,'z')]File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in_LocationIndexer.__getitem__(self, key)1189maybe_callable=com.apply_if_callable(key,self.obj)1190maybe_callable=self._check_deprecated_callable_usage(key,maybe_callable)->1191returnself._getitem_axis(maybe_callable,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in_LocIndexer._getitem_axis(self, key, axis)1409ifisinstance(key,slice):1410self._validate_key(key,axis)->1411returnself._get_slice_axis(key,axis=axis)1412elifcom.is_bool_indexer(key):1413returnself._getbool_axis(key,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in_LocIndexer._get_slice_axis(self, slice_obj, axis)1440returnobj.copy(deep=False)1442labels=obj._get_axis(axis)->1443indexer=labels.slice_indexer(slice_obj.start,slice_obj.stop,slice_obj.step)1445ifisinstance(indexer,slice):1446returnself.obj._slice(indexer,axis=axis)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, inIndex.slice_indexer(self, start, end, step)6618defslice_indexer(6619self,6620start:Hashable|None=None,6621end:Hashable|None=None,6622step:int|None=None,6623)->slice:6624"""6625 Compute the slice indexer for input labels and step.6626 (...)6660 slice(1, 3, None)6661 """->6662start_slice,end_slice=self.slice_locs(start,end,step=step)6664# return a slice6665ifnotis_scalar(start_slice):File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2904, inMultiIndex.slice_locs(self, start, end, step)2852"""2853 For an ordered MultiIndex, compute the slice locations for input2854 labels. (...)2900 sequence of such.2901 """2902# This function adds nothing to its parent implementation (the magic2903# happens in get_slice_bound method), but it adds meaningful doc.->2904returnsuper().slice_locs(start,end,step)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, inIndex.slice_locs(self, start, end, step)6877start_slice=None6878ifstartisnotNone:->6879start_slice=self.get_slice_bound(start,"left")6880ifstart_sliceisNone:6881start_slice=0File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2848, inMultiIndex.get_slice_bound(self, label, side)2846ifnotisinstance(label,tuple):2847label=(label,)->2848returnself._partial_tup_index(label,side=side)File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2908, inMultiIndex._partial_tup_index(self, tup, side)2906def_partial_tup_index(self,tup:tuple,side:Literal["left","right"]="left"):2907iflen(tup)>self._lexsort_depth:->2908raiseUnsortedIndexError(2909f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "2910f"({self._lexsort_depth})"2911)2913n=len(tup)2914start,end=0,len(self)UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
Theis_monotonic_increasing()
method on aMultiIndex
shows if theindex is sorted:
In [117]:dfm.index.is_monotonic_increasingOut[117]:False
In [118]:dfm=dfm.sort_index()In [119]:dfmOut[119]: joliejim joe0 x 0.490671 x 0.1202481 y 0.110968 z 0.537020In [120]:dfm.index.is_monotonic_increasingOut[120]:True
And now selection works as expected.
In [121]:dfm.loc[(0,"y"):(1,"z")]Out[121]: joliejim joe1 y 0.110968 z 0.537020
Take methods#
Similar to NumPy ndarrays, pandasIndex
,Series
, andDataFrame
also providesthetake()
method that retrieves elements along a given axis at the givenindices. The given indices must be either a list or an ndarray of integerindex positions.take
will also accept negative integers as relative positions to the end of the object.
In [122]:index=pd.Index(np.random.randint(0,1000,10))In [123]:indexOut[123]:Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')In [124]:positions=[0,9,3]In [125]:index[positions]Out[125]:Index([214, 329, 567], dtype='int64')In [126]:index.take(positions)Out[126]:Index([214, 329, 567], dtype='int64')In [127]:ser=pd.Series(np.random.randn(10))In [128]:ser.iloc[positions]Out[128]:0 -0.1796669 1.8243753 0.392149dtype: float64In [129]:ser.take(positions)Out[129]:0 -0.1796669 1.8243753 0.392149dtype: float64
For DataFrames, the given indices should be a 1d list or ndarray that specifiesrow or column positions.
In [130]:frm=pd.DataFrame(np.random.randn(5,3))In [131]:frm.take([1,4,3])Out[131]: 0 1 21 -1.237881 0.106854 -1.2768294 0.629675 -1.425966 1.8577043 0.979542 -1.633678 0.615855In [132]:frm.take([0,2],axis=1)Out[132]: 0 20 0.595974 0.6015441 -1.237881 -1.2768292 -0.767101 1.4995913 0.979542 0.6158554 0.629675 1.857704
It is important to note that thetake
method on pandas objects are notintended to work on boolean indices and may return unexpected results.
In [133]:arr=np.random.randn(10)In [134]:arr.take([False,False,True,True])Out[134]:array([-1.1935, -1.1935, 0.6775, 0.6775])In [135]:arr[[0,1]]Out[135]:array([-1.1935, 0.6775])In [136]:ser=pd.Series(np.random.randn(10))In [137]:ser.take([False,False,True,True])Out[137]:0 0.2331410 0.2331411 -0.2235401 -0.223540dtype: float64In [138]:ser.iloc[[0,1]]Out[138]:0 0.2331411 -0.223540dtype: float64
Finally, as a small note on performance, because thetake
method handlesa narrower range of inputs, it can offer performance that is a good dealfaster than fancy indexing.
In [139]:arr=np.random.randn(10000,5)In [140]:indexer=np.arange(10000)In [141]:random.shuffle(indexer)In [142]:%timeit arr[indexer] .....:%timeit arr.take(indexer, axis=0) .....:262 us +- 15.4 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)75.7 us +- 3.63 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
In [143]:ser=pd.Series(arr[:,0])In [144]:%timeit ser.iloc[indexer] .....:%timeit ser.take(indexer) .....:141 us +- 6.06 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)140 us +- 7.41 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
Index types#
We have discussedMultiIndex
in the previous sections pretty extensively.Documentation aboutDatetimeIndex
andPeriodIndex
are shownhere,and documentation aboutTimedeltaIndex
is foundhere.
In the following sub-sections we will highlight some other index types.
CategoricalIndex#
CategoricalIndex
is a type of index that is useful for supportingindexing with duplicates. This is a container around aCategorical
and allows efficient indexing and storage of an index with a large number of duplicated elements.
In [145]:frompandas.api.typesimportCategoricalDtypeIn [146]:df=pd.DataFrame({"A":np.arange(6),"B":list("aabbca")})In [147]:df["B"]=df["B"].astype(CategoricalDtype(list("cab")))In [148]:dfOut[148]: A B0 0 a1 1 a2 2 b3 3 b4 4 c5 5 aIn [149]:df.dtypesOut[149]:A int64B categorydtype: objectIn [150]:df["B"].cat.categoriesOut[150]:Index(['c', 'a', 'b'], dtype='object')
Setting the index will create aCategoricalIndex
.
In [151]:df2=df.set_index("B")In [152]:df2.indexOut[152]:CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
Indexing with__getitem__/.iloc/.loc
works similarly to anIndex
with duplicates.The indexersmust be in the category or the operation will raise aKeyError
.
In [153]:df2.loc["a"]Out[153]: ABa 0a 1a 5
TheCategoricalIndex
ispreserved after indexing:
In [154]:df2.loc["a"].indexOut[154]:CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
Sorting the index will sort by the order of the categories (recall that wecreated the index withCategoricalDtype(list('cab'))
, so the sortedorder iscab
).
In [155]:df2.sort_index()Out[155]: ABc 4a 0a 1a 5b 2b 3
Groupby operations on the index will preserve the index nature as well.
In [156]:df2.groupby(level=0,observed=True).sum()Out[156]: ABc 4a 6b 5In [157]:df2.groupby(level=0,observed=True).sum().indexOut[157]:CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
Reindexing operations will return a resulting index based on the type of the passedindexer. Passing a list will return a plain-oldIndex
; indexing withaCategorical
will return aCategoricalIndex
, indexed according to the categoriesof thepassedCategorical
dtype. This allows one to arbitrarily index these even withvaluesnot in the categories, similarly to how you can reindexany pandas index.
In [158]:df3=pd.DataFrame( .....:{"A":np.arange(3),"B":pd.Series(list("abc")).astype("category")} .....:) .....:In [159]:df3=df3.set_index("B")In [160]:df3Out[160]: ABa 0b 1c 2
In [161]:df3.reindex(["a","e"])Out[161]: ABa 0.0e NaNIn [162]:df3.reindex(["a","e"]).indexOut[162]:Index(['a', 'e'], dtype='object', name='B')In [163]:df3.reindex(pd.Categorical(["a","e"],categories=list("abe")))Out[163]: ABa 0.0e NaNIn [164]:df3.reindex(pd.Categorical(["a","e"],categories=list("abe"))).indexOut[164]:CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B')
Warning
Reshaping and Comparison operations on aCategoricalIndex
must have the same categoriesor aTypeError
will be raised.
In [165]:df4=pd.DataFrame({"A":np.arange(2),"B":list("ba")})In [166]:df4["B"]=df4["B"].astype(CategoricalDtype(list("ab")))In [167]:df4=df4.set_index("B")In [168]:df4.indexOut[168]:CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')In [169]:df5=pd.DataFrame({"A":np.arange(2),"B":list("bc")})In [170]:df5["B"]=df5["B"].astype(CategoricalDtype(list("bc")))In [171]:df5=df5.set_index("B")In [172]:df5.indexOut[172]:CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B')
In [173]:pd.concat([df4,df5])Out[173]: ABb 0a 1b 0c 1
RangeIndex#
RangeIndex
is a sub-class ofIndex
that provides the default index for allDataFrame
andSeries
objects.RangeIndex
is an optimized version ofIndex
that can represent a monotonic ordered set. These are analogous to Pythonrange types.ARangeIndex
will always have anint64
dtype.
In [174]:idx=pd.RangeIndex(5)In [175]:idxOut[175]:RangeIndex(start=0, stop=5, step=1)
RangeIndex
is the default index for allDataFrame
andSeries
objects:
In [176]:ser=pd.Series([1,2,3])In [177]:ser.indexOut[177]:RangeIndex(start=0, stop=3, step=1)In [178]:df=pd.DataFrame([[1,2],[3,4]])In [179]:df.indexOut[179]:RangeIndex(start=0, stop=2, step=1)In [180]:df.columnsOut[180]:RangeIndex(start=0, stop=2, step=1)
ARangeIndex
will behave similarly to aIndex
with anint64
dtype and operations on aRangeIndex
,whose result cannot be represented by aRangeIndex
, but should have an integer dtype, will be converted to anIndex
withint64
.For example:
In [181]:idx[[0,2]]Out[181]:Index([0, 2], dtype='int64')
IntervalIndex#
IntervalIndex
together with its own dtype,IntervalDtype
as well as theInterval
scalar type, allow first-class support in pandasfor interval notation.
TheIntervalIndex
allows some unique indexing and is also used as areturn type for the categories incut()
andqcut()
.
Indexing with anIntervalIndex
#
AnIntervalIndex
can be used inSeries
and inDataFrame
as the index.
In [182]:df=pd.DataFrame( .....:{"A":[1,2,3,4]},index=pd.IntervalIndex.from_breaks([0,1,2,3,4]) .....:) .....:In [183]:dfOut[183]: A(0, 1] 1(1, 2] 2(2, 3] 3(3, 4] 4
Label based indexing via.loc
along the edges of an interval works as you would expect,selecting that particular interval.
In [184]:df.loc[2]Out[184]:A 2Name: (1, 2], dtype: int64In [185]:df.loc[[2,3]]Out[185]: A(1, 2] 2(2, 3] 3
If you select a labelcontained within an interval, this will also select the interval.
In [186]:df.loc[2.5]Out[186]:A 3Name: (2, 3], dtype: int64In [187]:df.loc[[2.5,3.5]]Out[187]: A(2, 3] 3(3, 4] 4
Selecting using anInterval
will only return exact matches.
In [188]:df.loc[pd.Interval(1,2)]Out[188]:A 2Name: (1, 2], dtype: int64
Trying to select anInterval
that is not exactly contained in theIntervalIndex
will raise aKeyError
.
In [189]:df.loc[pd.Interval(0.5,2.5)]---------------------------------------------------------------------------KeyErrorTraceback (most recent call last)CellIn[189],line1---->1df.loc[pd.Interval(0.5,2.5)]File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in_LocationIndexer.__getitem__(self, key)1189maybe_callable=com.apply_if_callable(key,self.obj)1190maybe_callable=self._check_deprecated_callable_usage(key,maybe_callable)->1191returnself._getitem_axis(maybe_callable,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1431, in_LocIndexer._getitem_axis(self, key, axis)1429# fall thru to straight lookup1430self._validate_key(key,axis)->1431returnself._get_label(key,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1381, in_LocIndexer._get_label(self, label, axis)1379def_get_label(self,label,axis:AxisInt):1380# GH#5567 this will fail if the label is not present in the axis.->1381returnself.obj.xs(label,axis=axis)File ~/work/pandas/pandas/pandas/core/generic.py:4301, inNDFrame.xs(self, key, axis, level, drop_level)4299new_index=index[loc]4300else:->4301loc=index.get_loc(key)4303ifisinstance(loc,np.ndarray):4304ifloc.dtype==np.bool_:File ~/work/pandas/pandas/pandas/core/indexes/interval.py:678, inIntervalIndex.get_loc(self, key)676matches=mask.sum()677ifmatches==0:-->678raiseKeyError(key)679ifmatches==1:680returnmask.argmax()KeyError: Interval(0.5, 2.5, closed='right')
Selecting allIntervals
that overlap a givenInterval
can be performed using theoverlaps()
method to create a boolean indexer.
In [190]:idxr=df.index.overlaps(pd.Interval(0.5,2.5))In [191]:idxrOut[191]:array([ True, True, True, False])In [192]:df[idxr]Out[192]: A(0, 1] 1(1, 2] 2(2, 3] 3
Binning data withcut
andqcut
#
cut()
andqcut()
both return aCategorical
object, and the bins theycreate are stored as anIntervalIndex
in its.categories
attribute.
In [193]:c=pd.cut(range(4),bins=2)In [194]:cOut[194]:[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]In [195]:c.categoriesOut[195]:IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]')
cut()
also accepts anIntervalIndex
for itsbins
argument, which enablesa useful pandas idiom. First, We callcut()
with some data andbins
set to afixed number, to generate the bins. Then, we pass the values of.categories
as thebins
argument in subsequent calls tocut()
, supplying new data which will bebinned into the same bins.
In [196]:pd.cut([0,3,5,1],bins=c.categories)Out[196]:[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]
Any value which falls outside all bins will be assigned aNaN
value.
Generating ranges of intervals#
If we need intervals on a regular frequency, we can use theinterval_range()
functionto create anIntervalIndex
using various combinations ofstart
,end
, andperiods
.The default frequency forinterval_range
is a 1 for numeric intervals, and calendar day fordatetime-like intervals:
In [197]:pd.interval_range(start=0,end=5)Out[197]:IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')In [198]:pd.interval_range(start=pd.Timestamp("2017-01-01"),periods=4)Out[198]:IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00], (2017-01-02 00:00:00, 2017-01-03 00:00:00], (2017-01-03 00:00:00, 2017-01-04 00:00:00], (2017-01-04 00:00:00, 2017-01-05 00:00:00]], dtype='interval[datetime64[ns], right]')In [199]:pd.interval_range(end=pd.Timedelta("3 days"),periods=3)Out[199]:IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]], dtype='interval[timedelta64[ns], right]')
Thefreq
parameter can used to specify non-default frequencies, and can utilize a varietyoffrequency aliases with datetime-like intervals:
In [200]:pd.interval_range(start=0,periods=5,freq=1.5)Out[200]:IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')In [201]:pd.interval_range(start=pd.Timestamp("2017-01-01"),periods=4,freq="W")Out[201]:IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00], (2017-01-08 00:00:00, 2017-01-15 00:00:00], (2017-01-15 00:00:00, 2017-01-22 00:00:00], (2017-01-22 00:00:00, 2017-01-29 00:00:00]], dtype='interval[datetime64[ns], right]')In [202]:pd.interval_range(start=pd.Timedelta("0 days"),periods=3,freq="9h")Out[202]:IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]], dtype='interval[timedelta64[ns], right]')
Additionally, theclosed
parameter can be used to specify which side(s) the intervalsare closed on. Intervals are closed on the right side by default.
In [203]:pd.interval_range(start=0,end=4,closed="both")Out[203]:IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')In [204]:pd.interval_range(start=0,end=4,closed="neither")Out[204]:IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]')
Specifyingstart
,end
, andperiods
will generate a range of evenly spacedintervals fromstart
toend
inclusively, withperiods
number of elementsin the resultingIntervalIndex
:
In [205]:pd.interval_range(start=0,end=6,periods=4)Out[205]:IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')In [206]:pd.interval_range(pd.Timestamp("2018-01-01"),pd.Timestamp("2018-02-28"),periods=3)Out[206]:IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28 00:00:00]], dtype='interval[datetime64[ns], right]')
Miscellaneous indexing FAQ#
Integer indexing#
Label-based indexing with integer axis labels is a thorny topic. It has beendiscussed heavily on mailing lists and among various members of the scientificPython community. In pandas, our general viewpoint is that labels matter morethan integer locations. Therefore, with an integer axis indexonlylabel-based indexing is possible with the standard tools like.loc
. Thefollowing code will generate exceptions:
In [207]:s=pd.Series(range(5))In [208]:s[-1]---------------------------------------------------------------------------ValueErrorTraceback (most recent call last)File ~/work/pandas/pandas/pandas/core/indexes/range.py:413, inRangeIndex.get_loc(self, key)412try:-->413returnself._range.index(new_key)414exceptValueErroraserr:ValueError: -1 is not in rangeTheaboveexceptionwasthedirectcauseofthefollowingexception:KeyErrorTraceback (most recent call last)CellIn[208],line1---->1s[-1]File ~/work/pandas/pandas/pandas/core/series.py:1121, inSeries.__getitem__(self, key)1118returnself._values[key]1120elifkey_is_scalar:->1121returnself._get_value(key)1123# Convert generator to list before going through hashable part1124# (We will iterate through the generator there to check for slices)1125ifis_iterator(key):File ~/work/pandas/pandas/pandas/core/series.py:1237, inSeries._get_value(self, label, takeable)1234returnself._values[label]1236# Similar to Index.get_value, but we do not fall back to positional->1237loc=self.index.get_loc(label)1239ifis_integer(loc):1240returnself._values[loc]File ~/work/pandas/pandas/pandas/core/indexes/range.py:415, inRangeIndex.get_loc(self, key)413returnself._range.index(new_key)414exceptValueErroraserr:-->415raiseKeyError(key)fromerr416ifisinstance(key,Hashable):417raiseKeyError(key)KeyError: -1In [209]:df=pd.DataFrame(np.random.randn(5,4))In [210]:dfOut[210]: 0 1 2 30 -0.435772 -1.188928 -0.808286 -0.2846341 -1.815703 1.347213 -0.243487 0.5147042 1.162969 -0.287725 -0.179734 0.9939623 -0.212673 0.909872 -0.733333 -0.3498934 0.456434 -0.306735 0.553396 0.166221In [211]:df.loc[-2:]Out[211]: 0 1 2 30 -0.435772 -1.188928 -0.808286 -0.2846341 -1.815703 1.347213 -0.243487 0.5147042 1.162969 -0.287725 -0.179734 0.9939623 -0.212673 0.909872 -0.733333 -0.3498934 0.456434 -0.306735 0.553396 0.166221
This deliberate decision was made to prevent ambiguities and subtle bugs (manyusers reported finding bugs when the API change was made to stop “falling back”on position-based indexing).
Non-monotonic indexes require exact matches#
If the index of aSeries
orDataFrame
is monotonically increasing or decreasing, then the boundsof a label-based slice can be outside the range of the index, much like slice indexing anormal Pythonlist
. Monotonicity of an index can be tested with theis_monotonic_increasing()
andis_monotonic_decreasing()
attributes.
In [212]:df=pd.DataFrame(index=[2,3,3,4,5],columns=["data"],data=list(range(5)))In [213]:df.index.is_monotonic_increasingOut[213]:True# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:In [214]:df.loc[0:4,:]Out[214]: data2 03 13 24 3# slice is are outside the index, so empty DataFrame is returnedIn [215]:df.loc[13:15,:]Out[215]:Empty DataFrameColumns: [data]Index: []
On the other hand, if the index is not monotonic, then both slice bounds must beunique members of the index.
In [216]:df=pd.DataFrame(index=[2,3,1,4,3,5],columns=["data"],data=list(range(6)))In [217]:df.index.is_monotonic_increasingOut[217]:False# OK because 2 and 4 are in the indexIn [218]:df.loc[2:4,:]Out[218]: data2 03 11 24 3
# 0 is not in the indexIn [219]:df.loc[0:4,:]---------------------------------------------------------------------------KeyErrorTraceback (most recent call last)File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, inIndex.get_loc(self, key)3804try:->3805returnself._engine.get_loc(casted_key)3806exceptKeyErroraserr:File index.pyx:167, inpandas._libs.index.IndexEngine.get_loc()File index.pyx:191, inpandas._libs.index.IndexEngine.get_loc()File index.pyx:234, inpandas._libs.index.IndexEngine._get_loc_duplicates()File index.pyx:242, inpandas._libs.index.IndexEngine._maybe_get_bool_indexer()File index.pyx:134, inpandas._libs.index._unpack_bool_indexer()KeyError: 0Theaboveexceptionwasthedirectcauseofthefollowingexception:KeyErrorTraceback (most recent call last)CellIn[219],line1---->1df.loc[0:4,:]File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in_LocationIndexer.__getitem__(self, key)1182ifself._is_scalar_access(key):1183returnself.obj._get_value(*key,takeable=self._takeable)->1184returnself._getitem_tuple(key)1185else:1186# we by definition only have the 0th axis1187axis=self.axisor0File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in_LocIndexer._getitem_tuple(self, tup)1374ifself._multi_take_opportunity(tup):1375returnself._multi_take(tup)->1377returnself._getitem_tuple_same_dim(tup)File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in_LocationIndexer._getitem_tuple_same_dim(self, tup)1017ifcom.is_null_slice(key):1018continue->1020retval=getattr(retval,self.name)._getitem_axis(key,axis=i)1021# We should never have retval.ndim < self.ndim, as that should1022# be handled by the _getitem_lowerdim call above.1023assertretval.ndim==self.ndimFile ~/work/pandas/pandas/pandas/core/indexing.py:1411, in_LocIndexer._getitem_axis(self, key, axis)1409ifisinstance(key,slice):1410self._validate_key(key,axis)->1411returnself._get_slice_axis(key,axis=axis)1412elifcom.is_bool_indexer(key):1413returnself._getbool_axis(key,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in_LocIndexer._get_slice_axis(self, slice_obj, axis)1440returnobj.copy(deep=False)1442labels=obj._get_axis(axis)->1443indexer=labels.slice_indexer(slice_obj.start,slice_obj.stop,slice_obj.step)1445ifisinstance(indexer,slice):1446returnself.obj._slice(indexer,axis=axis)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, inIndex.slice_indexer(self, start, end, step)6618defslice_indexer(6619self,6620start:Hashable|None=None,6621end:Hashable|None=None,6622step:int|None=None,6623)->slice:6624"""6625 Compute the slice indexer for input labels and step.6626 (...)6660 slice(1, 3, None)6661 """->6662start_slice,end_slice=self.slice_locs(start,end,step=step)6664# return a slice6665ifnotis_scalar(start_slice):File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, inIndex.slice_locs(self, start, end, step)6877start_slice=None6878ifstartisnotNone:->6879start_slice=self.get_slice_bound(start,"left")6880ifstart_sliceisNone:6881start_slice=0File ~/work/pandas/pandas/pandas/core/indexes/base.py:6804, inIndex.get_slice_bound(self, label, side)6801returnself._searchsorted_monotonic(label,side)6802exceptValueError:6803# raise the original KeyError->6804raiseerr6806ifisinstance(slc,np.ndarray):6807# get_loc may return a boolean array, which6808# is OK as long as they are representable by a slice.6809assertis_bool_dtype(slc.dtype)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6798, inIndex.get_slice_bound(self, label, side)6796# we need to look up the label6797try:->6798slc=self.get_loc(label)6799exceptKeyErroraserr:6800try:File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, inIndex.get_loc(self, key)3807ifisinstance(casted_key,slice)or(3808isinstance(casted_key,abc.Iterable)3809andany(isinstance(x,slice)forxincasted_key)3810):3811raiseInvalidIndexError(key)->3812raiseKeyError(key)fromerr3813exceptTypeError:3814# If we have a listlike key, _check_indexing_error will raise3815# InvalidIndexError. Otherwise we fall through and re-raise3816# the TypeError.3817self._check_indexing_error(key)KeyError: 0# 3 is not a unique labelIn [220]:df.loc[2:3,:]---------------------------------------------------------------------------KeyErrorTraceback (most recent call last)CellIn[220],line1---->1df.loc[2:3,:]File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in_LocationIndexer.__getitem__(self, key)1182ifself._is_scalar_access(key):1183returnself.obj._get_value(*key,takeable=self._takeable)->1184returnself._getitem_tuple(key)1185else:1186# we by definition only have the 0th axis1187axis=self.axisor0File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in_LocIndexer._getitem_tuple(self, tup)1374ifself._multi_take_opportunity(tup):1375returnself._multi_take(tup)->1377returnself._getitem_tuple_same_dim(tup)File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in_LocationIndexer._getitem_tuple_same_dim(self, tup)1017ifcom.is_null_slice(key):1018continue->1020retval=getattr(retval,self.name)._getitem_axis(key,axis=i)1021# We should never have retval.ndim < self.ndim, as that should1022# be handled by the _getitem_lowerdim call above.1023assertretval.ndim==self.ndimFile ~/work/pandas/pandas/pandas/core/indexing.py:1411, in_LocIndexer._getitem_axis(self, key, axis)1409ifisinstance(key,slice):1410self._validate_key(key,axis)->1411returnself._get_slice_axis(key,axis=axis)1412elifcom.is_bool_indexer(key):1413returnself._getbool_axis(key,axis=axis)File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in_LocIndexer._get_slice_axis(self, slice_obj, axis)1440returnobj.copy(deep=False)1442labels=obj._get_axis(axis)->1443indexer=labels.slice_indexer(slice_obj.start,slice_obj.stop,slice_obj.step)1445ifisinstance(indexer,slice):1446returnself.obj._slice(indexer,axis=axis)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, inIndex.slice_indexer(self, start, end, step)6618defslice_indexer(6619self,6620start:Hashable|None=None,6621end:Hashable|None=None,6622step:int|None=None,6623)->slice:6624"""6625 Compute the slice indexer for input labels and step.6626 (...)6660 slice(1, 3, None)6661 """->6662start_slice,end_slice=self.slice_locs(start,end,step=step)6664# return a slice6665ifnotis_scalar(start_slice):File ~/work/pandas/pandas/pandas/core/indexes/base.py:6885, inIndex.slice_locs(self, start, end, step)6883end_slice=None6884ifendisnotNone:->6885end_slice=self.get_slice_bound(end,"right")6886ifend_sliceisNone:6887end_slice=len(self)File ~/work/pandas/pandas/pandas/core/indexes/base.py:6812, inIndex.get_slice_bound(self, label, side)6810slc=lib.maybe_booleans_to_slice(slc.view("u1"))6811ifisinstance(slc,np.ndarray):->6812raiseKeyError(6813f"Cannot get{side} slice bound for non-unique "6814f"label:{repr(original_label)}"6815)6817ifisinstance(slc,slice):6818ifside=="left":KeyError: 'Cannot get right slice bound for non-unique label: 3'
Index.is_monotonic_increasing
andIndex.is_monotonic_decreasing
only check thatan index is weakly monotonic. To check for strict monotonicity, you can combine one of those withtheis_unique()
attribute.
In [221]:weakly_monotonic=pd.Index(["a","b","c","c"])In [222]:weakly_monotonicOut[222]:Index(['a', 'b', 'c', 'c'], dtype='object')In [223]:weakly_monotonic.is_monotonic_increasingOut[223]:TrueIn [224]:weakly_monotonic.is_monotonic_increasing&weakly_monotonic.is_uniqueOut[224]:False
Endpoints are inclusive#
Compared with standard Python sequence slicing in which the slice endpoint isnot inclusive, label-based slicing in pandasis inclusive. The primaryreason for this is that it is often not possible to easily determine the“successor” or next element after a particular label in an index. For example,consider the followingSeries
:
In [225]:s=pd.Series(np.random.randn(6),index=list("abcdef"))In [226]:sOut[226]:a -0.101684b -0.734907c -0.130121d -0.476046e 0.759104f 0.213379dtype: float64
Suppose we wished to slice fromc
toe
, using integers this would beaccomplished as such:
In [227]:s[2:5]Out[227]:c -0.130121d -0.476046e 0.759104dtype: float64
However, if you only hadc
ande
, determining the next element in theindex can be somewhat complicated. For example, the following does not work:
In [228]:s.loc['c':'e'+1]---------------------------------------------------------------------------TypeErrorTraceback (most recent call last)CellIn[228],line1---->1s.loc['c':'e'+1]TypeError: can only concatenate str (not "int") to str
A very common use case is to limit a time series to start and end at twospecific dates. To enable this, we made the design choice to make label-basedslicing include both endpoints:
In [229]:s.loc["c":"e"]Out[229]:c -0.130121d -0.476046e 0.759104dtype: float64
This is most definitely a “practicality beats purity” sort of thing, but it issomething to watch out for if you expect label-based slicing to behave exactlyin the way that standard Python integer slicing works.
Indexing potentially changes underlying Series dtype#
The different indexing operation can potentially change the dtype of aSeries
.
In [230]:series1=pd.Series([1,2,3])In [231]:series1.dtypeOut[231]:dtype('int64')In [232]:res=series1.reindex([0,4])In [233]:res.dtypeOut[233]:dtype('float64')In [234]:resOut[234]:0 1.04 NaNdtype: float64
In [235]:series2=pd.Series([True])In [236]:series2.dtypeOut[236]:dtype('bool')In [237]:res=series2.reindex_like(series1)In [238]:res.dtypeOut[238]:dtype('O')In [239]:resOut[239]:0 True1 NaN2 NaNdtype: object
This is because the (re)indexing operations above silently insertsNaNs
and thedtype
changes accordingly. This can cause some issues when usingnumpy
ufuncs
such asnumpy.logical_and
.
See theGH 2388 for a moredetailed discussion.