Movatterモバイル変換


[0]ホーム

URL:


Navigation

Table Of Contents

Search

Enter search terms or a module, class or function name.

Comparison with R / R libraries

Sincepandas aims to provide a lot of the data manipulation and analysisfunctionality that people useR for, this pagewas started to provide a more detailed look at theR language and its many thirdparty libraries as they relate topandas. In comparisons with R and CRANlibraries, we care about the following things:

  • Functionality / flexibility: what can/cannot be done with each tool
  • Performance: how fast are operations. Hard numbers/benchmarks arepreferable
  • Ease-of-use: Is one tool easier/harder to use (you may have to bethe judge of this, given side-by-side code comparisons)

This page is also here to offer a bit of a translation guide for users of theseR packages.

For transfer ofDataFrame objects frompandas to R, one option is touse HDF5 files, seeExternal Compatibility for anexample.

Quick Reference

We’ll start off with a quick reference guide pairing some common Roperations usingdplyr withpandas equivalents.

Querying, Filtering, Sampling

Rpandas
dim(df)df.shape
head(df)df.head()
slice(df,1:10)df.iloc[:9]
filter(df,col1==1,col2==1)df.query('col1==1&col2==1')
df[df$col1==1&df$col2==1,]df[(df.col1==1)&(df.col2==1)]
select(df,col1,col2)df[['col1','col2']]
select(df,col1:col3)df.loc[:,'col1':'col3']
select(df,-(col1:col3))df.drop(cols_to_drop,axis=1) but see[1]
distinct(select(df,col1))df[['col1']].drop_duplicates()
distinct(select(df,col1,col2))df[['col1','col2']].drop_duplicates()
sample_n(df,10)df.sample(n=10)
sample_frac(df,0.01)df.sample(frac=0.01)
[1]R’s shorthand for a subrange of columns(select(df,col1:col3)) can be approachedcleanly in pandas, if you have the list of columns,for exampledf[cols[1:3]] ordf.drop(cols[1:3]), but doing this by columnname is a bit messy.

Sorting

Rpandas
arrange(df,col1,col2)df.sort_values(['col1','col2'])
arrange(df,desc(col1))df.sort_values('col1',ascending=False)

Transforming

Rpandas
select(df,col_one=col1)df.rename(columns={'col1':'col_one'})['col_one']
rename(df,col_one=col1)df.rename(columns={'col1':'col_one'})
mutate(df,c=a-b)df.assign(c=df.a-df.b)

Grouping and Summarizing

Rpandas
summary(df)df.describe()
gdf<-group_by(df,col1)gdf=df.groupby('col1')
summarise(gdf,avg=mean(col1,na.rm=TRUE))df.groupby('col1').agg({'col1':'mean'})
summarise(gdf,total=sum(col1))df.groupby('col1').sum()

Base R

Slicing with R’sc

R makes it easy to accessdata.frame columns by name

df<-data.frame(a=rnorm(5), b=rnorm(5),c=rnorm(5), d=rnorm(5), e=rnorm(5))df[,c("a","c","e")]

or by integer location

df<-data.frame(matrix(rnorm(1000), ncol=100))df[,c(1:10,25:30,40,50:100)]

Selecting multiple columns by name inpandas is straightforward

In [1]:df=pd.DataFrame(np.random.randn(10,3),columns=list('abc'))In [2]:df[['a','c']]Out[2]:          a         c0 -1.039575 -0.4249721  0.567020 -1.0874012 -0.673690 -1.4784273  0.524988  0.5770464 -1.715002 -0.3706475 -1.157892  0.8448856  1.075770  1.6435637 -1.469388 -0.6746008 -1.776904 -1.2945249  0.413738 -0.472035In [3]:df.loc[:,['a','c']]Out[3]:          a         c0 -1.039575 -0.4249721  0.567020 -1.0874012 -0.673690 -1.4784273  0.524988  0.5770464 -1.715002 -0.3706475 -1.157892  0.8448856  1.075770  1.6435637 -1.469388 -0.6746008 -1.776904 -1.2945249  0.413738 -0.472035

Selecting multiple noncontiguous columns by integer location can be achievedwith a combination of theiloc indexer attribute andnumpy.r_.

In [4]:named=list('abcdefg')In [5]:n=30In [6]:columns=named+np.arange(len(named),n).tolist()In [7]:df=pd.DataFrame(np.random.randn(n,n),columns=columns)In [8]:df.iloc[:,np.r_[:10,24:30]]Out[8]:           a         b         c         d         e         f         g  \0  -0.013960 -0.362543 -0.006154 -0.923061  0.895717  0.805244 -1.2064121   0.545952 -1.219217 -1.226825  0.769804 -1.281247 -0.727707 -0.1213062   2.396780  0.014871  3.357427 -0.317441 -1.236269  0.896171 -0.4876023  -0.988387  0.094055  1.262731  1.289997  0.082423 -0.055758  0.5365804  -1.340896  1.846883 -1.328865  1.682706 -1.717693  0.888782  0.2284405   0.464000  0.227371 -0.496922  0.306389 -2.290613 -1.134623 -1.5618196  -0.507516 -0.230096  0.394500 -1.934370 -1.652499  1.488753 -0.896484..       ...       ...       ...       ...       ...       ...       ...23 -0.083272 -0.273955 -0.772369 -1.242807 -0.386336 -0.182486  0.16481624  2.071413 -1.364763  1.122066  0.066847  1.751987  0.419071 -1.11828325  0.036609  0.359986  1.211905  0.850427  1.554957 -0.888463 -1.50880826 -1.179240  0.238923  1.756671 -0.747571  0.543625 -0.159609 -0.05145827  0.025645  0.932436 -1.694531 -0.182236 -1.072710  0.466764 -0.07267328  0.439086  0.812684 -0.128932 -0.142506 -1.137207  0.462001 -0.15946629 -0.909806 -0.312006  0.383630 -0.631606  1.321415 -0.004799 -2.008210           7         8         9        24        25        26        27  \0   2.565646  1.431256  1.340309  0.875906 -2.211372  0.974466 -2.0067471  -0.097883  0.695775  0.341734 -1.743161 -0.826591 -0.345352  1.3142322  -0.082240 -2.182937  0.380396  1.266143  0.299368 -0.863838  0.4082043  -0.489682  0.369374 -0.034571  0.221471 -0.744471  0.758527  1.7296894   0.901805  1.171216  0.520260  0.650776 -1.461665 -1.137707 -0.8910605  -0.260838  0.281957  1.523962 -0.008434  1.952541 -1.056652  0.5339466   0.576897  1.146000  1.487349  2.015523 -1.833722  1.771740 -0.670027..       ...       ...       ...       ...       ...       ...       ...23  0.065624  0.307665 -1.898358  1.389045 -0.873585 -0.699862  0.81247724  1.010694  0.877138 -0.611561 -1.040389 -0.796211  0.241596  0.38592225 -0.617855  0.536164  2.175585  1.872601 -2.513465 -0.139184  0.81049126  0.937882  0.617547  0.287918 -1.584814  0.307941  1.809049  0.29623727 -0.026233 -0.051744  0.001402  0.150664 -3.060395  0.040268  0.06609128 -1.788308  0.753604  0.918071  0.922729  0.869610  0.364726 -0.22610129 -0.481634 -2.056211 -2.106095  0.039227  0.211283  1.440190 -0.989193          28        290  -0.410001 -0.0786381   0.690579  0.9957612  -1.048089 -0.0257473  -0.964980 -0.8456964  -0.693921  1.6136165  -1.226970  0.0404036   0.049307 -0.521493..       ...       ...23 -0.469503  1.14270224 -0.486078  0.43304225  0.571599 -0.00067626 -0.143550  0.28940127 -0.192862  1.97905528 -0.657647 -0.95269929  0.313335 -0.399709[30 rows x 16 columns]

aggregate

In R you may want to split data into subsets and compute the mean for each.Using a data.frame calleddf and splitting it into groupsby1 andby2:

df<-data.frame(  v1=c(1,3,5,7,8,3,5,NA,4,5,7,9),  v2=c(11,33,55,77,88,33,55,NA,44,55,77,99),  by1=c("red","blue",1,2,NA,"big",1,2,"red",1,NA,12),  by2=c("wet","dry",99,95,NA,"damp",95,99,"red",99,NA,NA))aggregate(x=df[,c("v1","v2")], by=list(mydf2$by1, mydf2$by2), FUN=mean)

Thegroupby() method is similar to base Raggregatefunction.

In [9]:df=pd.DataFrame({   ...:'v1':[1,3,5,7,8,3,5,np.nan,4,5,7,9],   ...:'v2':[11,33,55,77,88,33,55,np.nan,44,55,77,99],   ...:'by1':["red","blue",1,2,np.nan,"big",1,2,"red",1,np.nan,12],   ...:'by2':["wet","dry",99,95,np.nan,"damp",95,99,"red",99,np.nan,   ...:np.nan]   ...:})   ...:In [10]:g=df.groupby(['by1','by2'])In [11]:g[['v1','v2']].mean()Out[11]:            v1    v2by1  by21    95    5.0  55.0     99    5.0  55.02    95    7.0  77.0     99    NaN   NaNbig  damp  3.0  33.0blue dry   3.0  33.0red  red   4.0  44.0     wet   1.0  11.0

For more details and examples seethe groupby documentation.

match /%in%

A common way to select data in R is using%in% which is defined using thefunctionmatch. The operator%in% is used to return a logical vectorindicating if there is a match or not:

s<-0:4s%in%c(2,4)

Theisin() method is similar to R%in% operator:

In [12]:s=pd.Series(np.arange(5),dtype=np.float32)In [13]:s.isin([2,4])Out[13]:0    False1    False2     True3    False4     Truedtype: bool

Thematch function returns a vector of the positions of matchesof its first argument in its second:

s<-0:4match(s,c(2,4))

Theapply() method can be used to replicatethis:

In [14]:s=pd.Series(np.arange(5),dtype=np.float32)In [15]:pd.Series(pd.match(s,[2,4],np.nan))Out[15]:0    NaN1    NaN2    0.03    NaN4    1.0dtype: float64

For more details and examples seethe reshaping documentation.

tapply

tapply is similar toaggregate, but data can be in a ragged array,since the subclass sizes are possibly irregular. Using a data.frame calledbaseball, and retrieving information based on the arrayteam:

baseball<-data.frame(team=gl(5,5,             labels=paste("Team",LETTERS[1:5])),             player=sample(letters,25),             batting.average= runif(25,.200,.400))tapply(baseball$batting.average, baseball.example$team,max)

Inpandas we may usepivot_table() method to handle this:

In [16]:importrandomIn [17]:importstringIn [18]:baseball=pd.DataFrame({   ....:'team':["team%d"%(x+1)forxinrange(5)]*5,   ....:'player':random.sample(list(string.ascii_lowercase),25),   ....:'batting avg':np.random.uniform(.200,.400,25)   ....:})   ....:In [19]:baseball.pivot_table(values='batting avg',columns='team',aggfunc=np.max)Out[19]:teamteam 1    0.394457team 2    0.395730team 3    0.343015team 4    0.388863team 5    0.377379Name: batting avg, dtype: float64

For more details and examples seethe reshaping documentation.

subset

New in version 0.13.

Thequery() method is similar to the base Rsubsetfunction. In R you might want to get the rows of adata.frame where onecolumn’s values are less than another column’s values:

df<-data.frame(a=rnorm(10), b=rnorm(10))subset(df, a<= b)df[df$a<= df$b,]# note the comma

Inpandas, there are a few ways to perform subsetting. You can usequery() or pass an expression as if it were anindex/slice as well as standard boolean indexing:

In [20]:df=pd.DataFrame({'a':np.random.randn(10),'b':np.random.randn(10)})In [21]:df.query('a <= b')Out[21]:          a         b0 -1.003455 -0.9907381  0.083515  0.5487963 -0.524392  0.9044004 -0.837804  0.7463748 -0.507219  0.245479In [22]:df[df.a<=df.b]Out[22]:          a         b0 -1.003455 -0.9907381  0.083515  0.5487963 -0.524392  0.9044004 -0.837804  0.7463748 -0.507219  0.245479In [23]:df.loc[df.a<=df.b]Out[23]:          a         b0 -1.003455 -0.9907381  0.083515  0.5487963 -0.524392  0.9044004 -0.837804  0.7463748 -0.507219  0.245479

For more details and examples seethe query documentation.

with

New in version 0.13.

An expression using a data.frame calleddf in R with the columnsa andb would be evaluated usingwith like so:

df<-data.frame(a=rnorm(10), b=rnorm(10))with(df, a+ b)df$a+ df$b# same as the previous expression

Inpandas the equivalent expression, using theeval() method, would be:

In [24]:df=pd.DataFrame({'a':np.random.randn(10),'b':np.random.randn(10)})In [25]:df.eval('a + b')Out[25]:0   -0.9202051   -0.8602362    1.1543703    0.1881404   -1.1637185    0.0013976   -0.8256947   -1.1381988   -1.7080349    1.148616dtype: float64In [26]:df.a+df.b# same as the previous expressionOut[26]:0   -0.9202051   -0.8602362    1.1543703    0.1881404   -1.1637185    0.0013976   -0.8256947   -1.1381988   -1.7080349    1.148616dtype: float64

In certain caseseval() will be much faster thanevaluation in pure Python. For more details and examples seethe evaldocumentation.

plyr

plyr is an R library for the split-apply-combine strategy for dataanalysis. The functions revolve around three data structures in R,aforarrays,l forlists, andd fordata.frame. Thetable below shows how these data structures could be mapped in Python.

RPython
arraylist
listsdictionary or list of objects
data.framedataframe

ddply

An expression using a data.frame calleddf in R where you want tosummarizex bymonth:

require(plyr)df<-data.frame(  x= runif(120,1,168),  y= runif(120,7,334),  z= runif(120,1.7,20.7),  month=rep(c(5,6,7,8),30),  week=sample(1:4,120,TRUE))ddply(df,.(month, week), summarize,      mean=round(mean(x),2),      sd=round(sd(x),2))

Inpandas the equivalent expression, using thegroupby() method, would be:

In [27]:df=pd.DataFrame({   ....:'x':np.random.uniform(1.,168.,120),   ....:'y':np.random.uniform(7.,334.,120),   ....:'z':np.random.uniform(1.7,20.7,120),   ....:'month':[5,6,7,8]*30,   ....:'week':np.random.randint(1,4,120)   ....:})   ....:In [28]:grouped=df.groupby(['month','week'])In [29]:grouped['x'].agg([np.mean,np.std])Out[29]:                  mean        stdmonth week5     1      71.840596  52.886392      2      71.904794  55.786805      3      89.845632  49.8923676     1      97.730877  52.442172      2      93.369836  47.178389      3      96.592088  58.7737447     1      59.255715  43.442336      2      69.634012  28.607369      3      84.510992  59.7610968     1     104.787666  31.745437      2      69.717872  53.747188      3      79.892221  52.950459

For more details and examples seethe groupby documentation.

reshape / reshape2

melt.array

An expression using a 3 dimensional array calleda in R where you want tomelt it into a data.frame:

a<-array(c(1:23,NA),c(2,3,4))data.frame(melt(a))

In Python, sincea is a list, you can simply use list comprehension.

In [30]:a=np.array(list(range(1,24))+[np.NAN]).reshape(2,3,4)In [31]:pd.DataFrame([tuple(list(x)+[val])forx,valinnp.ndenumerate(a)])Out[31]:    0  1  2     30   0  0  0   1.01   0  0  1   2.02   0  0  2   3.03   0  0  3   4.04   0  1  0   5.05   0  1  1   6.06   0  1  2   7.0.. .. .. ..   ...17  1  1  1  18.018  1  1  2  19.019  1  1  3  20.020  1  2  0  21.021  1  2  1  22.022  1  2  2  23.023  1  2  3   NaN[24 rows x 4 columns]

melt.list

An expression using a list calleda in R where you want to melt itinto a data.frame:

a<-as.list(c(1:4,NA))data.frame(melt(a))

In Python, this list would be a list of tuples, soDataFrame() method would convert it to a dataframe as required.

In [32]:a=list(enumerate(list(range(1,5))+[np.NAN]))In [33]:pd.DataFrame(a)Out[33]:   0    10  0  1.01  1  2.02  2  3.03  3  4.04  4  NaN

For more details and examples seethe Into to Data Structuresdocumentation.

melt.data.frame

An expression using a data.frame calledcheese in R where you want toreshape the data.frame:

cheese<-data.frame(  first=c('John','Mary'),  last=c('Doe','Bo'),  height=c(5.5,6.0),  weight=c(130,150))melt(cheese, id=c("first","last"))

In Python, themelt() method is the R equivalent:

In [34]:cheese=pd.DataFrame({'first':['John','Mary'],   ....:'last':['Doe','Bo'],   ....:'height':[5.5,6.0],   ....:'weight':[130,150]})   ....:In [35]:pd.melt(cheese,id_vars=['first','last'])Out[35]:  first last variable  value0  John  Doe   height    5.51  Mary   Bo   height    6.02  John  Doe   weight  130.03  Mary   Bo   weight  150.0In [36]:cheese.set_index(['first','last']).stack()# alternative wayOut[36]:first  lastJohn   Doe   height      5.5             weight    130.0Mary   Bo    height      6.0             weight    150.0dtype: float64

For more details and examples seethe reshaping documentation.

cast

In Racast is an expression using a data.frame calleddf in R to castinto a higher dimensional array:

df<-data.frame(  x= runif(12,1,168),  y= runif(12,7,334),  z= runif(12,1.7,20.7),  month=rep(c(5,6,7),4),  week=rep(c(1,2),6))mdf<- melt(df, id=c("month","week"))acast(mdf, week~ month~ variable,mean)

In Python the best way is to make use ofpivot_table():

In [37]:df=pd.DataFrame({   ....:'x':np.random.uniform(1.,168.,12),   ....:'y':np.random.uniform(7.,334.,12),   ....:'z':np.random.uniform(1.7,20.7,12),   ....:'month':[5,6,7]*4,   ....:'week':[1,2]*6   ....:})   ....:In [38]:mdf=pd.melt(df,id_vars=['month','week'])In [39]:pd.pivot_table(mdf,values='value',index=['variable','week'],   ....:columns=['month'],aggfunc=np.mean)   ....:Out[39]:month                   5           6           7variable weekx        1     114.001700  132.227290   65.808204         2     124.669553  147.495706   82.882820y        1     225.636630  301.864228   91.706834         2      57.692665  215.851669  218.004383z        1      17.793871    7.124644   17.679823         2      15.068355   13.873974    9.394966

Similarly fordcast which uses a data.frame calleddf in R toaggregate information based onAnimal andFeedType:

df<-data.frame(  Animal=c('Animal1','Animal2','Animal3','Animal2','Animal1','Animal2','Animal3'),  FeedType=c('A','B','A','A','B','B','A'),  Amount=c(10,7,4,2,5,6,2))dcast(df, Animal~ FeedType,sum, fill=NaN)# Alternative method using base Rwith(df,tapply(Amount,list(Animal, FeedType),sum))

Python can approach this in two different ways. Firstly, similar to aboveusingpivot_table():

In [40]:df=pd.DataFrame({   ....:'Animal':['Animal1','Animal2','Animal3','Animal2','Animal1',   ....:'Animal2','Animal3'],   ....:'FeedType':['A','B','A','A','B','B','A'],   ....:'Amount':[10,7,4,2,5,6,2],   ....:})   ....:In [41]:df.pivot_table(values='Amount',index='Animal',columns='FeedType',aggfunc='sum')Out[41]:FeedType     A     BAnimalAnimal1   10.0   5.0Animal2    2.0  13.0Animal3    6.0   NaN

The second approach is to use thegroupby() method:

In [42]:df.groupby(['Animal','FeedType'])['Amount'].sum()Out[42]:Animal   FeedTypeAnimal1  A           10         B            5Animal2  A            2         B           13Animal3  A            6Name: Amount, dtype: int64

For more details and examples seethe reshaping documentation orthe groupby documentation.

factor

New in version 0.15.

pandas has a data type for categorical data.

cut(c(1,2,3,4,5,6),3)factor(c(1,2,3,2,2,3))

In pandas this is accomplished withpd.cut andastype("category"):

In [43]:pd.cut(pd.Series([1,2,3,4,5,6]),3)Out[43]:0    (0.995, 2.667]1    (0.995, 2.667]2    (2.667, 4.333]3    (2.667, 4.333]4        (4.333, 6]5        (4.333, 6]dtype: categoryCategories (3, object): [(0.995, 2.667] < (2.667, 4.333] < (4.333, 6]]In [44]:pd.Series([1,2,3,2,2,3]).astype("category")Out[44]:0    11    22    33    24    25    3dtype: categoryCategories (3, int64): [1, 2, 3]

For more details and examples seecategorical introduction and theAPI documentation. There is also a documentation regarding thedifferences to R’s factor.

Navigation

Scroll To Top
[8]ページ先頭

©2009-2025 Movatter.jp