Enter search terms or a module, class or function name.
Sincepandas aims to provide a lot of the data manipulation and analysisfunctionality that people useR for, this pagewas started to provide a more detailed look at theR language and its many thirdparty libraries as they relate topandas. In comparisons with R and CRANlibraries, we care about the following things:
- Functionality / flexibility: what can/cannot be done with each tool
- Performance: how fast are operations. Hard numbers/benchmarks arepreferable
- Ease-of-use: Is one tool easier/harder to use (you may have to bethe judge of this, given side-by-side code comparisons)
This page is also here to offer a bit of a translation guide for users of theseR packages.
For transfer ofDataFrame objects frompandas to R, one option is touse HDF5 files, seeExternal Compatibility for anexample.
We’ll start off with a quick reference guide pairing some common Roperations usingdplyr withpandas equivalents.
| R | pandas |
|---|---|
dim(df) | df.shape |
head(df) | df.head() |
slice(df,1:10) | df.iloc[:9] |
filter(df,col1==1,col2==1) | df.query('col1==1&col2==1') |
df[df$col1==1&df$col2==1,] | df[(df.col1==1)&(df.col2==1)] |
select(df,col1,col2) | df[['col1','col2']] |
select(df,col1:col3) | df.loc[:,'col1':'col3'] |
select(df,-(col1:col3)) | df.drop(cols_to_drop,axis=1) but see[1] |
distinct(select(df,col1)) | df[['col1']].drop_duplicates() |
distinct(select(df,col1,col2)) | df[['col1','col2']].drop_duplicates() |
sample_n(df,10) | df.sample(n=10) |
sample_frac(df,0.01) | df.sample(frac=0.01) |
| [1] | R’s shorthand for a subrange of columns(select(df,col1:col3)) can be approachedcleanly in pandas, if you have the list of columns,for exampledf[cols[1:3]] ordf.drop(cols[1:3]), but doing this by columnname is a bit messy. |
| R | pandas |
|---|---|
arrange(df,col1,col2) | df.sort_values(['col1','col2']) |
arrange(df,desc(col1)) | df.sort_values('col1',ascending=False) |
| R | pandas |
|---|---|
select(df,col_one=col1) | df.rename(columns={'col1':'col_one'})['col_one'] |
rename(df,col_one=col1) | df.rename(columns={'col1':'col_one'}) |
mutate(df,c=a-b) | df.assign(c=df.a-df.b) |
| R | pandas |
|---|---|
summary(df) | df.describe() |
gdf<-group_by(df,col1) | gdf=df.groupby('col1') |
summarise(gdf,avg=mean(col1,na.rm=TRUE)) | df.groupby('col1').agg({'col1':'mean'}) |
summarise(gdf,total=sum(col1)) | df.groupby('col1').sum() |
c¶R makes it easy to accessdata.frame columns by name
df<-data.frame(a=rnorm(5), b=rnorm(5),c=rnorm(5), d=rnorm(5), e=rnorm(5))df[,c("a","c","e")]
or by integer location
df<-data.frame(matrix(rnorm(1000), ncol=100))df[,c(1:10,25:30,40,50:100)]
Selecting multiple columns by name inpandas is straightforward
In [1]:df=pd.DataFrame(np.random.randn(10,3),columns=list('abc'))In [2]:df[['a','c']]Out[2]: a c0 -1.039575 -0.4249721 0.567020 -1.0874012 -0.673690 -1.4784273 0.524988 0.5770464 -1.715002 -0.3706475 -1.157892 0.8448856 1.075770 1.6435637 -1.469388 -0.6746008 -1.776904 -1.2945249 0.413738 -0.472035In [3]:df.loc[:,['a','c']]Out[3]: a c0 -1.039575 -0.4249721 0.567020 -1.0874012 -0.673690 -1.4784273 0.524988 0.5770464 -1.715002 -0.3706475 -1.157892 0.8448856 1.075770 1.6435637 -1.469388 -0.6746008 -1.776904 -1.2945249 0.413738 -0.472035
Selecting multiple noncontiguous columns by integer location can be achievedwith a combination of theiloc indexer attribute andnumpy.r_.
In [4]:named=list('abcdefg')In [5]:n=30In [6]:columns=named+np.arange(len(named),n).tolist()In [7]:df=pd.DataFrame(np.random.randn(n,n),columns=columns)In [8]:df.iloc[:,np.r_[:10,24:30]]Out[8]: a b c d e f g \0 -0.013960 -0.362543 -0.006154 -0.923061 0.895717 0.805244 -1.2064121 0.545952 -1.219217 -1.226825 0.769804 -1.281247 -0.727707 -0.1213062 2.396780 0.014871 3.357427 -0.317441 -1.236269 0.896171 -0.4876023 -0.988387 0.094055 1.262731 1.289997 0.082423 -0.055758 0.5365804 -1.340896 1.846883 -1.328865 1.682706 -1.717693 0.888782 0.2284405 0.464000 0.227371 -0.496922 0.306389 -2.290613 -1.134623 -1.5618196 -0.507516 -0.230096 0.394500 -1.934370 -1.652499 1.488753 -0.896484.. ... ... ... ... ... ... ...23 -0.083272 -0.273955 -0.772369 -1.242807 -0.386336 -0.182486 0.16481624 2.071413 -1.364763 1.122066 0.066847 1.751987 0.419071 -1.11828325 0.036609 0.359986 1.211905 0.850427 1.554957 -0.888463 -1.50880826 -1.179240 0.238923 1.756671 -0.747571 0.543625 -0.159609 -0.05145827 0.025645 0.932436 -1.694531 -0.182236 -1.072710 0.466764 -0.07267328 0.439086 0.812684 -0.128932 -0.142506 -1.137207 0.462001 -0.15946629 -0.909806 -0.312006 0.383630 -0.631606 1.321415 -0.004799 -2.008210 7 8 9 24 25 26 27 \0 2.565646 1.431256 1.340309 0.875906 -2.211372 0.974466 -2.0067471 -0.097883 0.695775 0.341734 -1.743161 -0.826591 -0.345352 1.3142322 -0.082240 -2.182937 0.380396 1.266143 0.299368 -0.863838 0.4082043 -0.489682 0.369374 -0.034571 0.221471 -0.744471 0.758527 1.7296894 0.901805 1.171216 0.520260 0.650776 -1.461665 -1.137707 -0.8910605 -0.260838 0.281957 1.523962 -0.008434 1.952541 -1.056652 0.5339466 0.576897 1.146000 1.487349 2.015523 -1.833722 1.771740 -0.670027.. ... ... ... ... ... ... ...23 0.065624 0.307665 -1.898358 1.389045 -0.873585 -0.699862 0.81247724 1.010694 0.877138 -0.611561 -1.040389 -0.796211 0.241596 0.38592225 -0.617855 0.536164 2.175585 1.872601 -2.513465 -0.139184 0.81049126 0.937882 0.617547 0.287918 -1.584814 0.307941 1.809049 0.29623727 -0.026233 -0.051744 0.001402 0.150664 -3.060395 0.040268 0.06609128 -1.788308 0.753604 0.918071 0.922729 0.869610 0.364726 -0.22610129 -0.481634 -2.056211 -2.106095 0.039227 0.211283 1.440190 -0.989193 28 290 -0.410001 -0.0786381 0.690579 0.9957612 -1.048089 -0.0257473 -0.964980 -0.8456964 -0.693921 1.6136165 -1.226970 0.0404036 0.049307 -0.521493.. ... ...23 -0.469503 1.14270224 -0.486078 0.43304225 0.571599 -0.00067626 -0.143550 0.28940127 -0.192862 1.97905528 -0.657647 -0.95269929 0.313335 -0.399709[30 rows x 16 columns]
aggregate¶In R you may want to split data into subsets and compute the mean for each.Using a data.frame calleddf and splitting it into groupsby1 andby2:
df<-data.frame( v1=c(1,3,5,7,8,3,5,NA,4,5,7,9), v2=c(11,33,55,77,88,33,55,NA,44,55,77,99), by1=c("red","blue",1,2,NA,"big",1,2,"red",1,NA,12), by2=c("wet","dry",99,95,NA,"damp",95,99,"red",99,NA,NA))aggregate(x=df[,c("v1","v2")], by=list(mydf2$by1, mydf2$by2), FUN=mean)
Thegroupby() method is similar to base Raggregatefunction.
In [9]:df=pd.DataFrame({ ...:'v1':[1,3,5,7,8,3,5,np.nan,4,5,7,9], ...:'v2':[11,33,55,77,88,33,55,np.nan,44,55,77,99], ...:'by1':["red","blue",1,2,np.nan,"big",1,2,"red",1,np.nan,12], ...:'by2':["wet","dry",99,95,np.nan,"damp",95,99,"red",99,np.nan, ...:np.nan] ...:}) ...:In [10]:g=df.groupby(['by1','by2'])In [11]:g[['v1','v2']].mean()Out[11]: v1 v2by1 by21 95 5.0 55.0 99 5.0 55.02 95 7.0 77.0 99 NaN NaNbig damp 3.0 33.0blue dry 3.0 33.0red red 4.0 44.0 wet 1.0 11.0
For more details and examples seethe groupby documentation.
match /%in%¶A common way to select data in R is using%in% which is defined using thefunctionmatch. The operator%in% is used to return a logical vectorindicating if there is a match or not:
s<-0:4s%in%c(2,4)
Theisin() method is similar to R%in% operator:
In [12]:s=pd.Series(np.arange(5),dtype=np.float32)In [13]:s.isin([2,4])Out[13]:0 False1 False2 True3 False4 Truedtype: bool
Thematch function returns a vector of the positions of matchesof its first argument in its second:
s<-0:4match(s,c(2,4))
Theapply() method can be used to replicatethis:
In [14]:s=pd.Series(np.arange(5),dtype=np.float32)In [15]:pd.Series(pd.match(s,[2,4],np.nan))Out[15]:0 NaN1 NaN2 0.03 NaN4 1.0dtype: float64
For more details and examples seethe reshaping documentation.
tapply¶tapply is similar toaggregate, but data can be in a ragged array,since the subclass sizes are possibly irregular. Using a data.frame calledbaseball, and retrieving information based on the arrayteam:
baseball<-data.frame(team=gl(5,5, labels=paste("Team",LETTERS[1:5])), player=sample(letters,25), batting.average= runif(25,.200,.400))tapply(baseball$batting.average, baseball.example$team,max)
Inpandas we may usepivot_table() method to handle this:
In [16]:importrandomIn [17]:importstringIn [18]:baseball=pd.DataFrame({ ....:'team':["team%d"%(x+1)forxinrange(5)]*5, ....:'player':random.sample(list(string.ascii_lowercase),25), ....:'batting avg':np.random.uniform(.200,.400,25) ....:}) ....:In [19]:baseball.pivot_table(values='batting avg',columns='team',aggfunc=np.max)Out[19]:teamteam 1 0.394457team 2 0.395730team 3 0.343015team 4 0.388863team 5 0.377379Name: batting avg, dtype: float64
For more details and examples seethe reshaping documentation.
subset¶New in version 0.13.
Thequery() method is similar to the base Rsubsetfunction. In R you might want to get the rows of adata.frame where onecolumn’s values are less than another column’s values:
df<-data.frame(a=rnorm(10), b=rnorm(10))subset(df, a<= b)df[df$a<= df$b,]# note the comma
Inpandas, there are a few ways to perform subsetting. You can usequery() or pass an expression as if it were anindex/slice as well as standard boolean indexing:
In [20]:df=pd.DataFrame({'a':np.random.randn(10),'b':np.random.randn(10)})In [21]:df.query('a <= b')Out[21]: a b0 -1.003455 -0.9907381 0.083515 0.5487963 -0.524392 0.9044004 -0.837804 0.7463748 -0.507219 0.245479In [22]:df[df.a<=df.b]Out[22]: a b0 -1.003455 -0.9907381 0.083515 0.5487963 -0.524392 0.9044004 -0.837804 0.7463748 -0.507219 0.245479In [23]:df.loc[df.a<=df.b]Out[23]: a b0 -1.003455 -0.9907381 0.083515 0.5487963 -0.524392 0.9044004 -0.837804 0.7463748 -0.507219 0.245479
For more details and examples seethe query documentation.
with¶New in version 0.13.
An expression using a data.frame calleddf in R with the columnsa andb would be evaluated usingwith like so:
df<-data.frame(a=rnorm(10), b=rnorm(10))with(df, a+ b)df$a+ df$b# same as the previous expression
Inpandas the equivalent expression, using theeval() method, would be:
In [24]:df=pd.DataFrame({'a':np.random.randn(10),'b':np.random.randn(10)})In [25]:df.eval('a + b')Out[25]:0 -0.9202051 -0.8602362 1.1543703 0.1881404 -1.1637185 0.0013976 -0.8256947 -1.1381988 -1.7080349 1.148616dtype: float64In [26]:df.a+df.b# same as the previous expressionOut[26]:0 -0.9202051 -0.8602362 1.1543703 0.1881404 -1.1637185 0.0013976 -0.8256947 -1.1381988 -1.7080349 1.148616dtype: float64
In certain caseseval() will be much faster thanevaluation in pure Python. For more details and examples seethe evaldocumentation.
plyr is an R library for the split-apply-combine strategy for dataanalysis. The functions revolve around three data structures in R,aforarrays,l forlists, andd fordata.frame. Thetable below shows how these data structures could be mapped in Python.
| R | Python |
|---|---|
| array | list |
| lists | dictionary or list of objects |
| data.frame | dataframe |
ddply¶An expression using a data.frame calleddf in R where you want tosummarizex bymonth:
require(plyr)df<-data.frame( x= runif(120,1,168), y= runif(120,7,334), z= runif(120,1.7,20.7), month=rep(c(5,6,7,8),30), week=sample(1:4,120,TRUE))ddply(df,.(month, week), summarize, mean=round(mean(x),2), sd=round(sd(x),2))
Inpandas the equivalent expression, using thegroupby() method, would be:
In [27]:df=pd.DataFrame({ ....:'x':np.random.uniform(1.,168.,120), ....:'y':np.random.uniform(7.,334.,120), ....:'z':np.random.uniform(1.7,20.7,120), ....:'month':[5,6,7,8]*30, ....:'week':np.random.randint(1,4,120) ....:}) ....:In [28]:grouped=df.groupby(['month','week'])In [29]:grouped['x'].agg([np.mean,np.std])Out[29]: mean stdmonth week5 1 71.840596 52.886392 2 71.904794 55.786805 3 89.845632 49.8923676 1 97.730877 52.442172 2 93.369836 47.178389 3 96.592088 58.7737447 1 59.255715 43.442336 2 69.634012 28.607369 3 84.510992 59.7610968 1 104.787666 31.745437 2 69.717872 53.747188 3 79.892221 52.950459
For more details and examples seethe groupby documentation.
melt.array¶An expression using a 3 dimensional array calleda in R where you want tomelt it into a data.frame:
a<-array(c(1:23,NA),c(2,3,4))data.frame(melt(a))
In Python, sincea is a list, you can simply use list comprehension.
In [30]:a=np.array(list(range(1,24))+[np.NAN]).reshape(2,3,4)In [31]:pd.DataFrame([tuple(list(x)+[val])forx,valinnp.ndenumerate(a)])Out[31]: 0 1 2 30 0 0 0 1.01 0 0 1 2.02 0 0 2 3.03 0 0 3 4.04 0 1 0 5.05 0 1 1 6.06 0 1 2 7.0.. .. .. .. ...17 1 1 1 18.018 1 1 2 19.019 1 1 3 20.020 1 2 0 21.021 1 2 1 22.022 1 2 2 23.023 1 2 3 NaN[24 rows x 4 columns]
melt.list¶An expression using a list calleda in R where you want to melt itinto a data.frame:
a<-as.list(c(1:4,NA))data.frame(melt(a))
In Python, this list would be a list of tuples, soDataFrame() method would convert it to a dataframe as required.
In [32]:a=list(enumerate(list(range(1,5))+[np.NAN]))In [33]:pd.DataFrame(a)Out[33]: 0 10 0 1.01 1 2.02 2 3.03 3 4.04 4 NaN
For more details and examples seethe Into to Data Structuresdocumentation.
melt.data.frame¶An expression using a data.frame calledcheese in R where you want toreshape the data.frame:
cheese<-data.frame( first=c('John','Mary'), last=c('Doe','Bo'), height=c(5.5,6.0), weight=c(130,150))melt(cheese, id=c("first","last"))
In Python, themelt() method is the R equivalent:
In [34]:cheese=pd.DataFrame({'first':['John','Mary'], ....:'last':['Doe','Bo'], ....:'height':[5.5,6.0], ....:'weight':[130,150]}) ....:In [35]:pd.melt(cheese,id_vars=['first','last'])Out[35]: first last variable value0 John Doe height 5.51 Mary Bo height 6.02 John Doe weight 130.03 Mary Bo weight 150.0In [36]:cheese.set_index(['first','last']).stack()# alternative wayOut[36]:first lastJohn Doe height 5.5 weight 130.0Mary Bo height 6.0 weight 150.0dtype: float64
For more details and examples seethe reshaping documentation.
cast¶In Racast is an expression using a data.frame calleddf in R to castinto a higher dimensional array:
df<-data.frame( x= runif(12,1,168), y= runif(12,7,334), z= runif(12,1.7,20.7), month=rep(c(5,6,7),4), week=rep(c(1,2),6))mdf<- melt(df, id=c("month","week"))acast(mdf, week~ month~ variable,mean)
In Python the best way is to make use ofpivot_table():
In [37]:df=pd.DataFrame({ ....:'x':np.random.uniform(1.,168.,12), ....:'y':np.random.uniform(7.,334.,12), ....:'z':np.random.uniform(1.7,20.7,12), ....:'month':[5,6,7]*4, ....:'week':[1,2]*6 ....:}) ....:In [38]:mdf=pd.melt(df,id_vars=['month','week'])In [39]:pd.pivot_table(mdf,values='value',index=['variable','week'], ....:columns=['month'],aggfunc=np.mean) ....:Out[39]:month 5 6 7variable weekx 1 114.001700 132.227290 65.808204 2 124.669553 147.495706 82.882820y 1 225.636630 301.864228 91.706834 2 57.692665 215.851669 218.004383z 1 17.793871 7.124644 17.679823 2 15.068355 13.873974 9.394966
Similarly fordcast which uses a data.frame calleddf in R toaggregate information based onAnimal andFeedType:
df<-data.frame( Animal=c('Animal1','Animal2','Animal3','Animal2','Animal1','Animal2','Animal3'), FeedType=c('A','B','A','A','B','B','A'), Amount=c(10,7,4,2,5,6,2))dcast(df, Animal~ FeedType,sum, fill=NaN)# Alternative method using base Rwith(df,tapply(Amount,list(Animal, FeedType),sum))
Python can approach this in two different ways. Firstly, similar to aboveusingpivot_table():
In [40]:df=pd.DataFrame({ ....:'Animal':['Animal1','Animal2','Animal3','Animal2','Animal1', ....:'Animal2','Animal3'], ....:'FeedType':['A','B','A','A','B','B','A'], ....:'Amount':[10,7,4,2,5,6,2], ....:}) ....:In [41]:df.pivot_table(values='Amount',index='Animal',columns='FeedType',aggfunc='sum')Out[41]:FeedType A BAnimalAnimal1 10.0 5.0Animal2 2.0 13.0Animal3 6.0 NaN
The second approach is to use thegroupby() method:
In [42]:df.groupby(['Animal','FeedType'])['Amount'].sum()Out[42]:Animal FeedTypeAnimal1 A 10 B 5Animal2 A 2 B 13Animal3 A 6Name: Amount, dtype: int64
For more details and examples seethe reshaping documentation orthe groupby documentation.
factor¶New in version 0.15.
pandas has a data type for categorical data.
cut(c(1,2,3,4,5,6),3)factor(c(1,2,3,2,2,3))
In pandas this is accomplished withpd.cut andastype("category"):
In [43]:pd.cut(pd.Series([1,2,3,4,5,6]),3)Out[43]:0 (0.995, 2.667]1 (0.995, 2.667]2 (2.667, 4.333]3 (2.667, 4.333]4 (4.333, 6]5 (4.333, 6]dtype: categoryCategories (3, object): [(0.995, 2.667] < (2.667, 4.333] < (4.333, 6]]In [44]:pd.Series([1,2,3,2,2,3]).astype("category")Out[44]:0 11 22 33 24 25 3dtype: categoryCategories (3, int64): [1, 2, 3]
For more details and examples seecategorical introduction and theAPI documentation. There is also a documentation regarding thedifferences to R’s factor.