Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Ctrl+K

Reshaping and pivot tables#

pandas provides methods for manipulating aSeries andDataFrame to alter therepresentation of the data for further data processing or data summarization.

pivot() andpivot_table()#

../_images/reshaping_pivot.png

pivot()#

Data is often stored in so-called “stacked” or “record” format. In a “record” or “wide” format,typically there is one row for each subject. In the “stacked” or “long” format there aremultiple rows for each subject where applicable.

In [1]:data={   ...:"value":range(12),   ...:"variable":["A"]*3+["B"]*3+["C"]*3+["D"]*3,   ...:"date":pd.to_datetime(["2020-01-03","2020-01-04","2020-01-05"]*4)   ...:}   ...:In [2]:df=pd.DataFrame(data)

To perform time series operations with each unique variable, a betterrepresentation would be where thecolumns are the unique variables and anindex of dates identifies individual observations. To reshape the data intothis form, we use theDataFrame.pivot() method (also implemented as atop level functionpivot()):

In [3]:pivoted=df.pivot(index="date",columns="variable",values="value")In [4]:pivotedOut[4]:variable    A  B  C   Ddate2020-01-03  0  3  6   92020-01-04  1  4  7  102020-01-05  2  5  8  11

If thevalues argument is omitted, and the inputDataFrame has more thanone column of values which are not used as column or index inputs topivot(),then the resulting “pivoted”DataFrame will havehierarchical columns whose topmost level indicates the respective valuecolumn:

In [5]:df["value2"]=df["value"]*2In [6]:pivoted=df.pivot(index="date",columns="variable")In [7]:pivotedOut[7]:           value           value2variable       A  B  C   D      A   B   C   Ddate2020-01-03     0  3  6   9      0   6  12  182020-01-04     1  4  7  10      2   8  14  202020-01-05     2  5  8  11      4  10  16  22

You can then select subsets from the pivotedDataFrame:

In [8]:pivoted["value2"]Out[8]:variable    A   B   C   Ddate2020-01-03  0   6  12  182020-01-04  2   8  14  202020-01-05  4  10  16  22

Note that this returns a view on the underlying data in the case where the dataare homogeneously-typed.

Note

pivot() can only handle unique rows specified byindex andcolumns.If you data contains duplicates, usepivot_table().

pivot_table()#

Whilepivot() provides general purpose pivoting with variousdata types, pandas also providespivot_table() orpivot_table()for pivoting with aggregation of numeric data.

The functionpivot_table() can be used to create spreadsheet-stylepivot tables. See thecookbook for some advancedstrategies.

In [9]:importdatetimeIn [10]:df=pd.DataFrame(   ....:{   ....:"A":["one","one","two","three"]*6,   ....:"B":["A","B","C"]*8,   ....:"C":["foo","foo","foo","bar","bar","bar"]*4,   ....:"D":np.random.randn(24),   ....:"E":np.random.randn(24),   ....:"F":[datetime.datetime(2013,i,1)foriinrange(1,13)]   ....:+[datetime.datetime(2013,i,15)foriinrange(1,13)],   ....:}   ....:)   ....:In [11]:dfOut[11]:        A  B    C         D         E          F0     one  A  foo  0.469112  0.404705 2013-01-011     one  B  foo -0.282863  0.577046 2013-02-012     two  C  foo -1.509059 -1.715002 2013-03-013   three  A  bar -1.135632 -1.039268 2013-04-014     one  B  bar  1.212112 -0.370647 2013-05-01..    ... ..  ...       ...       ...        ...19  three  B  foo -1.087401 -0.472035 2013-08-1520    one  C  foo -0.673690 -0.013960 2013-09-1521    one  A  bar  0.113648 -0.362543 2013-10-1522    two  B  bar -1.478427 -0.006154 2013-11-1523  three  C  bar  0.524988 -0.923061 2013-12-15[24 rows x 6 columns]In [12]:pd.pivot_table(df,values="D",index=["A","B"],columns=["C"])Out[12]:C             bar       fooA     Bone   A -0.995460  0.595334      B  0.393570 -0.494817      C  0.196903 -0.767769three A -0.431886       NaN      B       NaN -1.065818      C  0.798396       NaNtwo   A       NaN  0.197720      B -0.986678       NaN      C       NaN -1.274317In [13]:pd.pivot_table(   ....:df,values=["D","E"],   ....:index=["B"],   ....:columns=["A","C"],   ....:aggfunc="sum",   ....:)   ....:Out[13]:          D                      ...         EA       one               three  ...     three      twoC       bar       foo       bar  ...       foo      bar       fooB                                ...A -1.990921  1.190667 -0.863772  ...       NaN      NaN -1.067650B  0.787140 -0.989634       NaN  ...  0.372851  1.63741       NaNC  0.393806 -1.535539  1.596791  ...       NaN      NaN -3.491906[3 rows x 12 columns]In [14]:pd.pivot_table(   ....:df,values="E",   ....:index=["B","C"],   ....:columns=["A"],   ....:aggfunc=["sum","mean"],   ....:)   ....:Out[14]:            sum                          meanA           one     three       two       one     three       twoB CA bar -0.471593 -2.008182       NaN -0.235796 -1.004091       NaN  foo  0.761726       NaN -1.067650  0.380863       NaN -0.533825B bar -1.665170       NaN  1.637410 -0.832585       NaN  0.818705  foo -0.097554  0.372851       NaN -0.048777  0.186425       NaNC bar -0.744154 -2.392449       NaN -0.372077 -1.196224       NaN  foo  1.061810       NaN -3.491906  0.530905       NaN -1.745953

The result is aDataFrame potentially having aMultiIndex on theindex or column. If thevalues column name is not given, the pivot tablewill include all of the data in an additional level of hierarchy in the columns:

In [15]:pd.pivot_table(df[["A","B","C","D","E"]],index=["A","B"],columns=["C"])Out[15]:                D                   EC             bar       foo       bar       fooA     Bone   A -0.995460  0.595334 -0.235796  0.380863      B  0.393570 -0.494817 -0.832585 -0.048777      C  0.196903 -0.767769 -0.372077  0.530905three A -0.431886       NaN -1.004091       NaN      B       NaN -1.065818       NaN  0.186425      C  0.798396       NaN -1.196224       NaNtwo   A       NaN  0.197720       NaN -0.533825      B -0.986678       NaN  0.818705       NaN      C       NaN -1.274317       NaN -1.745953

Also, you can useGrouper forindex andcolumns keywords. For detail ofGrouper, seeGrouping with a Grouper specification.

In [16]:pd.pivot_table(df,values="D",index=pd.Grouper(freq="ME",key="F"),columns="C")Out[16]:C                bar       fooF2013-01-31       NaN  0.5953342013-02-28       NaN -0.4948172013-03-31       NaN -1.2743172013-04-30 -0.431886       NaN2013-05-31  0.393570       NaN2013-06-30  0.196903       NaN2013-07-31       NaN  0.1977202013-08-31       NaN -1.0658182013-09-30       NaN -0.7677692013-10-31 -0.995460       NaN2013-11-30 -0.986678       NaN2013-12-31  0.798396       NaN

Adding margins#

Passingmargins=True topivot_table() will add a row and column with anAll label with partial group aggregates across the categories on therows and columns:

In [17]:table=df.pivot_table(   ....:index=["A","B"],   ....:columns="C",   ....:values=["D","E"],   ....:margins=True,   ....:aggfunc="std"   ....:)   ....:In [18]:tableOut[18]:                D                             EC             bar       foo       All       bar       foo       AllA     Bone   A  1.568517  0.178504  1.293926  0.179247  0.033718  0.371275      B  1.157593  0.299748  0.860059  0.653280  0.885047  0.779837      C  0.523425  0.133049  0.638297  1.111310  0.770555  0.938819three A  0.995247       NaN  0.995247  0.049748       NaN  0.049748      B       NaN  0.030522  0.030522       NaN  0.931203  0.931203      C  0.386657       NaN  0.386657  0.386312       NaN  0.386312two   A       NaN  0.111032  0.111032       NaN  1.146201  1.146201      B  0.695438       NaN  0.695438  1.166526       NaN  1.166526      C       NaN  0.331975  0.331975       NaN  0.043771  0.043771All      1.014073  0.713941  0.871016  0.881376  0.984017  0.923568

Additionally, you can callDataFrame.stack() to display a pivoted DataFrameas having a multi-level index:

In [19]:table.stack(future_stack=True)Out[19]:                  D         EA   B Cone A bar  1.568517  0.179247      foo  0.178504  0.033718      All  1.293926  0.371275    B bar  1.157593  0.653280      foo  0.299748  0.885047...             ...       ...two C foo  0.331975  0.043771      All  0.331975  0.043771All   bar  1.014073  0.881376      foo  0.713941  0.984017      All  0.871016  0.923568[30 rows x 2 columns]

stack() andunstack()#

../_images/reshaping_stack.png

Closely related to thepivot() method are the relatedstack() andunstack() methods available onSeries andDataFrame. These methods are designed to work together withMultiIndex objects (see the section onhierarchical indexing).

  • stack(): “pivot” a level of the (possibly hierarchical) column labels,returning aDataFrame with an index with a new inner-most level of rowlabels.

  • unstack(): (inverse operation ofstack()) “pivot” a level of the(possibly hierarchical) row index to the column axis, producing a reshapedDataFrame with a new inner-most level of column labels.

../_images/reshaping_unstack.png
In [20]:tuples=[   ....:["bar","bar","baz","baz","foo","foo","qux","qux"],   ....:["one","two","one","two","one","two","one","two"],   ....:]   ....:In [21]:index=pd.MultiIndex.from_arrays(tuples,names=["first","second"])In [22]:df=pd.DataFrame(np.random.randn(8,2),index=index,columns=["A","B"])In [23]:df2=df[:4]In [24]:df2Out[24]:                     A         Bfirst secondbar   one     0.895717  0.805244      two    -1.206412  2.565646baz   one     1.431256  1.340309      two    -1.170299 -0.226169

Thestack() function “compresses” a level in theDataFrame columns toproduce either:

If the columns have aMultiIndex, you can choose which level to stack. Thestacked level becomes the new lowest level in aMultiIndex on the columns:

In [25]:stacked=df2.stack(future_stack=True)In [26]:stackedOut[26]:first  secondbar    one     A    0.895717               B    0.805244       two     A   -1.206412               B    2.565646baz    one     A    1.431256               B    1.340309       two     A   -1.170299               B   -0.226169dtype: float64

With a “stacked”DataFrame orSeries (having aMultiIndex as theindex), the inverse operation ofstack() isunstack(), which by defaultunstacks thelast level:

In [27]:stacked.unstack()Out[27]:                     A         Bfirst secondbar   one     0.895717  0.805244      two    -1.206412  2.565646baz   one     1.431256  1.340309      two    -1.170299 -0.226169In [28]:stacked.unstack(1)Out[28]:second        one       twofirstbar   A  0.895717 -1.206412      B  0.805244  2.565646baz   A  1.431256 -1.170299      B  1.340309 -0.226169In [29]:stacked.unstack(0)Out[29]:first          bar       bazsecondone    A  0.895717  1.431256       B  0.805244  1.340309two    A -1.206412 -1.170299       B  2.565646 -0.226169
../_images/reshaping_unstack_1.png

If the indexes have names, you can use the level names instead of specifyingthe level numbers:

In [30]:stacked.unstack("second")Out[30]:second        one       twofirstbar   A  0.895717 -1.206412      B  0.805244  2.565646baz   A  1.431256 -1.170299      B  1.340309 -0.226169
../_images/reshaping_unstack_0.png

Notice that thestack() andunstack() methods implicitly sort the indexlevels involved. Hence a call tostack() and thenunstack(), or vice versa,will result in asorted copy of the originalDataFrame orSeries:

In [31]:index=pd.MultiIndex.from_product([[2,1],["a","b"]])In [32]:df=pd.DataFrame(np.random.randn(4),index=index,columns=["A"])In [33]:dfOut[33]:            A2 a -1.413681  b  1.6079201 a  1.024180  b  0.569605In [34]:all(df.unstack().stack(future_stack=True)==df.sort_index())Out[34]:True

Multiple levels#

You may also stack or unstack more than one level at a time by passing a listof levels, in which case the end result is as if each level in the list wereprocessed individually.

In [35]:columns=pd.MultiIndex.from_tuples(   ....:[   ....:("A","cat","long"),   ....:("B","cat","long"),   ....:("A","dog","short"),   ....:("B","dog","short"),   ....:],   ....:names=["exp","animal","hair_length"],   ....:)   ....:In [36]:df=pd.DataFrame(np.random.randn(4,4),columns=columns)In [37]:dfOut[37]:exp                 A         B         A         Banimal            cat       cat       dog       doghair_length      long      long     short     short0            0.875906 -2.211372  0.974466 -2.0067471           -0.410001 -0.078638  0.545952 -1.2192172           -1.226825  0.769804 -1.281247 -0.7277073           -0.121306 -0.097883  0.695775  0.341734In [38]:df.stack(level=["animal","hair_length"],future_stack=True)Out[38]:exp                          A         B  animal hair_length0 cat    long         0.875906 -2.211372  dog    short        0.974466 -2.0067471 cat    long        -0.410001 -0.078638  dog    short        0.545952 -1.2192172 cat    long        -1.226825  0.769804  dog    short       -1.281247 -0.7277073 cat    long        -0.121306 -0.097883  dog    short        0.695775  0.341734

The list of levels can contain either level names or level numbers butnot a mixture of the two.

# df.stack(level=['animal', 'hair_length'], future_stack=True)# from above is equivalent to:In [39]:df.stack(level=[1,2],future_stack=True)Out[39]:exp                          A         B  animal hair_length0 cat    long         0.875906 -2.211372  dog    short        0.974466 -2.0067471 cat    long        -0.410001 -0.078638  dog    short        0.545952 -1.2192172 cat    long        -1.226825  0.769804  dog    short       -1.281247 -0.7277073 cat    long        -0.121306 -0.097883  dog    short        0.695775  0.341734

Missing data#

Unstacking can result in missing values if subgroups do not have the sameset of labels. By default, missing values will be replaced with the defaultfill value for that data type.

In [40]:columns=pd.MultiIndex.from_tuples(   ....:[   ....:("A","cat"),   ....:("B","dog"),   ....:("B","cat"),   ....:("A","dog"),   ....:],   ....:names=["exp","animal"],   ....:)   ....:In [41]:index=pd.MultiIndex.from_product(   ....:[("bar","baz","foo","qux"),("one","two")],names=["first","second"]   ....:)   ....:In [42]:df=pd.DataFrame(np.random.randn(8,4),index=index,columns=columns)In [43]:df3=df.iloc[[0,1,4,7],[1,2]]In [44]:df3Out[44]:exp                  Banimal             dog       catfirst secondbar   one    -1.110336 -0.619976      two     0.687738  0.176444foo   one     1.314232  0.690579qux   two     0.380396  0.084844In [45]:df3.unstack()Out[45]:exp            Banimal       dog                 catsecond       one       two       one       twofirstbar    -1.110336  0.687738 -0.619976  0.176444foo     1.314232       NaN  0.690579       NaNqux          NaN  0.380396       NaN  0.084844

The missing value can be filled with a specific value with thefill_value argument.

In [46]:df3.unstack(fill_value=-1e9)Out[46]:exp                Banimal           dog                         catsecond           one           two           one           twofirstbar    -1.110336e+00  6.877384e-01 -6.199759e-01  1.764443e-01foo     1.314232e+00 -1.000000e+09  6.905793e-01 -1.000000e+09qux    -1.000000e+09  3.803956e-01 -1.000000e+09  8.484421e-02

melt() andwide_to_long()#

../_images/reshaping_melt.png

The top-levelmelt() function and the correspondingDataFrame.melt()are useful to massage aDataFrame into a format where one or more columnsareidentifier variables, while all other columns, consideredmeasuredvariables, are “unpivoted” to the row axis, leaving just two non-identifiercolumns, “variable” and “value”. The names of those columns can be customizedby supplying thevar_name andvalue_name parameters.

In [47]:cheese=pd.DataFrame(   ....:{   ....:"first":["John","Mary"],   ....:"last":["Doe","Bo"],   ....:"height":[5.5,6.0],   ....:"weight":[130,150],   ....:}   ....:)   ....:In [48]:cheeseOut[48]:  first last  height  weight0  John  Doe     5.5     1301  Mary   Bo     6.0     150In [49]:cheese.melt(id_vars=["first","last"])Out[49]:  first last variable  value0  John  Doe   height    5.51  Mary   Bo   height    6.02  John  Doe   weight  130.03  Mary   Bo   weight  150.0In [50]:cheese.melt(id_vars=["first","last"],var_name="quantity")Out[50]:  first last quantity  value0  John  Doe   height    5.51  Mary   Bo   height    6.02  John  Doe   weight  130.03  Mary   Bo   weight  150.0

When transforming a DataFrame usingmelt(), the index will be ignored.The original index values can be kept by setting theignore_index=False parameter toFalse (default isTrue).ignore_index=False will however duplicate index values.

In [51]:index=pd.MultiIndex.from_tuples([("person","A"),("person","B")])In [52]:cheese=pd.DataFrame(   ....:{   ....:"first":["John","Mary"],   ....:"last":["Doe","Bo"],   ....:"height":[5.5,6.0],   ....:"weight":[130,150],   ....:},   ....:index=index,   ....:)   ....:In [53]:cheeseOut[53]:         first last  height  weightperson A  John  Doe     5.5     130       B  Mary   Bo     6.0     150In [54]:cheese.melt(id_vars=["first","last"])Out[54]:  first last variable  value0  John  Doe   height    5.51  Mary   Bo   height    6.02  John  Doe   weight  130.03  Mary   Bo   weight  150.0In [55]:cheese.melt(id_vars=["first","last"],ignore_index=False)Out[55]:         first last variable  valueperson A  John  Doe   height    5.5       B  Mary   Bo   height    6.0       A  John  Doe   weight  130.0       B  Mary   Bo   weight  150.0

wide_to_long() is similar tomelt() with more customization forcolumn matching.

In [56]:dft=pd.DataFrame(   ....:{   ....:"A1970":{0:"a",1:"b",2:"c"},   ....:"A1980":{0:"d",1:"e",2:"f"},   ....:"B1970":{0:2.5,1:1.2,2:0.7},   ....:"B1980":{0:3.2,1:1.3,2:0.1},   ....:"X":dict(zip(range(3),np.random.randn(3))),   ....:}   ....:)   ....:In [57]:dft["id"]=dft.indexIn [58]:dftOut[58]:  A1970 A1980  B1970  B1980         X  id0     a     d    2.5    3.2  1.519970   01     b     e    1.2    1.3 -0.493662   12     c     f    0.7    0.1  0.600178   2In [59]:pd.wide_to_long(dft,["A","B"],i="id",j="year")Out[59]:                X  A    Bid year0  1970  1.519970  a  2.51  1970 -0.493662  b  1.22  1970  0.600178  c  0.70  1980  1.519970  d  3.21  1980 -0.493662  e  1.32  1980  0.600178  f  0.1

get_dummies() andfrom_dummies()#

To convert categorical variables of aSeries into a “dummy” or “indicator”,get_dummies() creates a newDataFrame with columns of the uniquevariables and the values representing the presence of those variables per row.

In [60]:df=pd.DataFrame({"key":list("bbacab"),"data1":range(6)})In [61]:pd.get_dummies(df["key"])Out[61]:       a      b      c0  False   True  False1  False   True  False2   True  False  False3  False  False   True4   True  False  False5  False   True  FalseIn [62]:df["key"].str.get_dummies()Out[62]:   a  b  c0  0  1  01  0  1  02  1  0  03  0  0  14  1  0  05  0  1  0

prefix adds a prefix to the the column names which is useful for merging the resultwith the originalDataFrame:

In [63]:dummies=pd.get_dummies(df["key"],prefix="key")In [64]:dummiesOut[64]:   key_a  key_b  key_c0  False   True  False1  False   True  False2   True  False  False3  False  False   True4   True  False  False5  False   True  FalseIn [65]:df[["data1"]].join(dummies)Out[65]:   data1  key_a  key_b  key_c0      0  False   True  False1      1  False   True  False2      2   True  False  False3      3  False  False   True4      4   True  False  False5      5  False   True  False

This function is often used along with discretization functions likecut():

In [66]:values=np.random.randn(10)In [67]:valuesOut[67]:array([ 0.2742,  0.1329, -0.0237,  2.4102,  1.4505,  0.2061, -0.2519,       -2.2136,  1.0633,  1.2661])In [68]:bins=[0,0.2,0.4,0.6,0.8,1]In [69]:pd.get_dummies(pd.cut(values,bins))Out[69]:   (0.0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1.0]0       False        True       False       False       False1        True       False       False       False       False2       False       False       False       False       False3       False       False       False       False       False4       False       False       False       False       False5       False        True       False       False       False6       False       False       False       False       False7       False       False       False       False       False8       False       False       False       False       False9       False       False       False       False       False

get_dummies() also accepts aDataFrame. By default,object,string,orcategorical type columns are encoded as dummy variables with other columns unaltered.

In [70]:df=pd.DataFrame({"A":["a","b","a"],"B":["c","c","b"],"C":[1,2,3]})In [71]:pd.get_dummies(df)Out[71]:   C    A_a    A_b    B_b    B_c0  1   True  False  False   True1  2  False   True  False   True2  3   True  False   True  False

Specifying thecolumns keyword will encode a column of any type.

In [72]:pd.get_dummies(df,columns=["A"])Out[72]:   B  C    A_a    A_b0  c  1   True  False1  c  2  False   True2  b  3   True  False

As with theSeries version, you can pass values for theprefix andprefix_sep. By default the column name is used as the prefix and_ asthe prefix separator. You can specifyprefix andprefix_sep in 3 ways:

  • string: Use the same value forprefix orprefix_sep for each columnto be encoded.

  • list: Must be the same length as the number of columns being encoded.

  • dict: Mapping column name to prefix.

In [73]:simple=pd.get_dummies(df,prefix="new_prefix")In [74]:simpleOut[74]:   C  new_prefix_a  new_prefix_b  new_prefix_b  new_prefix_c0  1          True         False         False          True1  2         False          True         False          True2  3          True         False          True         FalseIn [75]:from_list=pd.get_dummies(df,prefix=["from_A","from_B"])In [76]:from_listOut[76]:   C  from_A_a  from_A_b  from_B_b  from_B_c0  1      True     False     False      True1  2     False      True     False      True2  3      True     False      True     FalseIn [77]:from_dict=pd.get_dummies(df,prefix={"B":"from_B","A":"from_A"})In [78]:from_dictOut[78]:   C  from_A_a  from_A_b  from_B_b  from_B_c0  1      True     False     False      True1  2     False      True     False      True2  3      True     False      True     False

To avoid collinearity when feeding the result to statistical models,specifydrop_first=True.

In [79]:s=pd.Series(list("abcaa"))In [80]:pd.get_dummies(s)Out[80]:       a      b      c0   True  False  False1  False   True  False2  False  False   True3   True  False  False4   True  False  FalseIn [81]:pd.get_dummies(s,drop_first=True)Out[81]:       b      c0  False  False1   True  False2  False   True3  False  False4  False  False

When a column contains only one level, it will be omitted in the result.

In [82]:df=pd.DataFrame({"A":list("aaaaa"),"B":list("ababc")})In [83]:pd.get_dummies(df)Out[83]:    A_a    B_a    B_b    B_c0  True   True  False  False1  True  False   True  False2  True   True  False  False3  True  False   True  False4  True  False  False   TrueIn [84]:pd.get_dummies(df,drop_first=True)Out[84]:     B_b    B_c0  False  False1   True  False2  False  False3   True  False4  False   True

The values can be cast to a different type using thedtype argument.

In [85]:df=pd.DataFrame({"A":list("abc"),"B":[1.1,2.2,3.3]})In [86]:pd.get_dummies(df,dtype=np.float32).dtypesOut[86]:B      float64A_a    float32A_b    float32A_c    float32dtype: object

Added in version 1.5.0.

from_dummies() converts the output ofget_dummies() back intoaSeries of categorical values from indicator values.

In [87]:df=pd.DataFrame({"prefix_a":[0,1,0],"prefix_b":[1,0,1]})In [88]:dfOut[88]:   prefix_a  prefix_b0         0         11         1         02         0         1In [89]:pd.from_dummies(df,sep="_")Out[89]:  prefix0      b1      a2      b

Dummy coded data only requiresk-1 categories to be included, in this casethe last category is the default category. The default category can be modified withdefault_category.

In [90]:df=pd.DataFrame({"prefix_a":[0,1,0]})In [91]:dfOut[91]:   prefix_a0         01         12         0In [92]:pd.from_dummies(df,sep="_",default_category="b")Out[92]:  prefix0      b1      a2      b

explode()#

For aDataFrame column with nested, list-like values,explode() will transformeach list-like value to a separate row. The resultingIndex will be duplicated correspondingto the index label from the original row:

In [93]:keys=["panda1","panda2","panda3"]In [94]:values=[["eats","shoots"],["shoots","leaves"],["eats","leaves"]]In [95]:df=pd.DataFrame({"keys":keys,"values":values})In [96]:dfOut[96]:     keys            values0  panda1    [eats, shoots]1  panda2  [shoots, leaves]2  panda3    [eats, leaves]In [97]:df["values"].explode()Out[97]:0      eats0    shoots1    shoots1    leaves2      eats2    leavesName: values, dtype: object

DataFrame.explode can also explode the column in theDataFrame.

In [98]:df.explode("values")Out[98]:     keys  values0  panda1    eats0  panda1  shoots1  panda2  shoots1  panda2  leaves2  panda3    eats2  panda3  leaves

Series.explode() will replace empty lists with a missing value indicator and preserve scalar entries.

In [99]:s=pd.Series([[1,2,3],"foo",[],["a","b"]])In [100]:sOut[100]:0    [1, 2, 3]1          foo2           []3       [a, b]dtype: objectIn [101]:s.explode()Out[101]:0      10      20      31    foo2    NaN3      a3      bdtype: object

A comma-separated string value can be split into individual values in a list and then exploded to a new row.

In [102]:df=pd.DataFrame([{"var1":"a,b,c","var2":1},{"var1":"d,e,f","var2":2}])In [103]:df.assign(var1=df.var1.str.split(",")).explode("var1")Out[103]:  var1  var20    a     10    b     10    c     11    d     21    e     21    f     2

crosstab()#

Usecrosstab() to compute a cross-tabulation of two (or more)factors. By defaultcrosstab() computes a frequency table of the factorsunless an array of values and an aggregation function are passed.

AnySeries passed will have their name attributes used unless row or columnnames for the cross-tabulation are specified

In [104]:a=np.array(["foo","foo","bar","bar","foo","foo"],dtype=object)In [105]:b=np.array(["one","one","two","one","two","one"],dtype=object)In [106]:c=np.array(["dull","dull","shiny","dull","dull","shiny"],dtype=object)In [107]:pd.crosstab(a,[b,c],rownames=["a"],colnames=["b","c"])Out[107]:b    one        twoc   dull shiny dull shinyabar    1     0    0     1foo    2     1    1     0

Ifcrosstab() receives only twoSeries, it will provide a frequency table.

In [108]:df=pd.DataFrame(   .....:{"A":[1,2,2,2,2],"B":[3,3,4,4,4],"C":[1,1,np.nan,1,1]}   .....:)   .....:In [109]:dfOut[109]:   A  B    C0  1  3  1.01  2  3  1.02  2  4  NaN3  2  4  1.04  2  4  1.0In [110]:pd.crosstab(df["A"],df["B"])Out[110]:B  3  4A1  1  02  1  3

crosstab() can also summarize toCategorical data.

In [111]:foo=pd.Categorical(["a","b"],categories=["a","b","c"])In [112]:bar=pd.Categorical(["d","e"],categories=["d","e","f"])In [113]:pd.crosstab(foo,bar)Out[113]:col_0  d  erow_0a      1  0b      0  1

ForCategorical data, to includeall of data categories even if the actual data doesnot contain any instances of a particular category, usedropna=False.

In [114]:pd.crosstab(foo,bar,dropna=False)Out[114]:col_0  d  e  frow_0a      1  0  0b      0  1  0c      0  0  0

Normalization#

Frequency tables can also be normalized to show percentages rather than countsusing thenormalize argument:

In [115]:pd.crosstab(df["A"],df["B"],normalize=True)Out[115]:B    3    4A1  0.2  0.02  0.2  0.6

normalize can also normalize values within each row or within each column:

In [116]:pd.crosstab(df["A"],df["B"],normalize="columns")Out[116]:B    3    4A1  0.5  0.02  0.5  1.0

crosstab() can also accept a thirdSeries and an aggregation function(aggfunc) that will be applied to the values of the thirdSeries withineach group defined by the first twoSeries:

In [117]:pd.crosstab(df["A"],df["B"],values=df["C"],aggfunc="sum")Out[117]:B    3    4A1  1.0  NaN2  1.0  2.0

Adding margins#

margins=True will add a row and column with anAll label with partial group aggregatesacross the categories on the rows and columns:

In [118]:pd.crosstab(   .....:df["A"],df["B"],values=df["C"],aggfunc="sum",normalize=True,margins=True   .....:)   .....:Out[118]:B       3    4   AllA1    0.25  0.0  0.252    0.25  0.5  0.75All  0.50  0.5  1.00

cut()#

Thecut() function computes groupings for the values of the inputarray and is often used to transform continuous variables to discrete orcategorical variables:

An integerbins will form equal-width bins.

In [119]:ages=np.array([10,15,13,12,23,25,28,59,60])In [120]:pd.cut(ages,bins=3)Out[120]:[(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (26.667, 43.333], (43.333, 60.0], (43.333, 60.0]]Categories (3, interval[float64, right]): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60.0]]

A list of ordered bin edges will assign an interval for each variable.

In [121]:pd.cut(ages,bins=[0,18,35,70])Out[121]:[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], (18, 35], (18, 35], (35, 70], (35, 70]]Categories (3, interval[int64, right]): [(0, 18] < (18, 35] < (35, 70]]

If thebins keyword is anIntervalIndex, then these will beused to bin the passed data.

In [122]:pd.cut(ages,bins=pd.IntervalIndex.from_breaks([0,40,70]))Out[122]:[(0, 40], (0, 40], (0, 40], (0, 40], (0, 40], (0, 40], (0, 40], (40, 70], (40, 70]]Categories (2, interval[int64, right]): [(0, 40] < (40, 70]]

factorize()#

factorize() encodes 1 dimensional values into integer labels. Missing valuesare encoded as-1.

In [123]:x=pd.Series(["A","A",np.nan,"B",3.14,np.inf])In [124]:xOut[124]:0       A1       A2     NaN3       B4    3.145     infdtype: objectIn [125]:labels,uniques=pd.factorize(x)In [126]:labelsOut[126]:array([ 0,  0, -1,  1,  2,  3])In [127]:uniquesOut[127]:Index(['A', 'B', 3.14, inf], dtype='object')

Categorical will similarly encode 1 dimensional values for furthercategorical operations

In [128]:pd.Categorical(x)Out[128]:['A', 'A', NaN, 'B', 3.14, inf]Categories (4, object): [3.14, inf, 'A', 'B']

[8]ページ先頭

©2009-2025 Movatter.jp