Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Ctrl+K

Comparison with R / R libraries#

Since pandas aims to provide a lot of the data manipulation and analysisfunctionality that people useR for, this pagewas started to provide a more detailed look at theR language and its many thirdparty libraries as they relate to pandas. In comparisons with R and CRANlibraries, we care about the following things:

  • Functionality / flexibility: what can/cannot be done with each tool

  • Performance: how fast are operations. Hard numbers/benchmarks arepreferable

  • Ease-of-use: Is one tool easier/harder to use (you may have to bethe judge of this, given side-by-side code comparisons)

This page is also here to offer a bit of a translation guide for users of theseR packages.

Quick reference#

We’ll start off with a quick reference guide pairing some common Roperations usingdplyr withpandas equivalents.

Querying, filtering, sampling#

R

pandas

dim(df)

df.shape

head(df)

df.head()

slice(df,1:10)

df.iloc[:9]

filter(df,col1==1,col2==1)

df.query('col1==1&col2==1')

df[df$col1==1&df$col2==1,]

df[(df.col1==1)&(df.col2==1)]

select(df,col1,col2)

df[['col1','col2']]

select(df,col1:col3)

df.loc[:,'col1':'col3']

select(df,-(col1:col3))

df.drop(cols_to_drop,axis=1) but see[1]

distinct(select(df,col1))

df[['col1']].drop_duplicates()

distinct(select(df,col1,col2))

df[['col1','col2']].drop_duplicates()

sample_n(df,10)

df.sample(n=10)

sample_frac(df,0.01)

df.sample(frac=0.01)

[1]

R’s shorthand for a subrange of columns(select(df,col1:col3)) can be approachedcleanly in pandas, if you have the list of columns,for exampledf[cols[1:3]] ordf.drop(cols[1:3]), but doing this by columnname is a bit messy.

Sorting#

R

pandas

arrange(df,col1,col2)

df.sort_values(['col1','col2'])

arrange(df,desc(col1))

df.sort_values('col1',ascending=False)

Transforming#

R

pandas

select(df,col_one=col1)

df.rename(columns={'col1':'col_one'})['col_one']

rename(df,col_one=col1)

df.rename(columns={'col1':'col_one'})

mutate(df,c=a-b)

df.assign(c=df['a']-df['b'])

Grouping and summarizing#

R

pandas

summary(df)

df.describe()

gdf<-group_by(df,col1)

gdf=df.groupby('col1')

summarise(gdf,avg=mean(col1,na.rm=TRUE))

df.groupby('col1').agg({'col1':'mean'})

summarise(gdf,total=sum(col1))

df.groupby('col1').sum()

Base R#

Slicing with R’sc#

R makes it easy to accessdata.frame columns by name

df<-data.frame(a=rnorm(5),b=rnorm(5),c=rnorm(5),d=rnorm(5),e=rnorm(5))df[,c("a","c","e")]

or by integer location

df<-data.frame(matrix(rnorm(1000),ncol=100))df[,c(1:10,25:30,40,50:100)]

Selecting multiple columns by name in pandas is straightforward

In [1]:df=pd.DataFrame(np.random.randn(10,3),columns=list("abc"))In [2]:df[["a","c"]]Out[2]:          a         c0  0.469112 -1.5090591 -1.135632 -0.1732152  0.119209 -0.8618493 -2.104569  1.0718044  0.721555 -1.0395755  0.271860  0.5670206  0.276232 -0.6736907  0.113648  0.5249888  0.404705 -1.7150029 -1.039268 -1.157892In [3]:df.loc[:,["a","c"]]Out[3]:          a         c0  0.469112 -1.5090591 -1.135632 -0.1732152  0.119209 -0.8618493 -2.104569  1.0718044  0.721555 -1.0395755  0.271860  0.5670206  0.276232 -0.6736907  0.113648  0.5249888  0.404705 -1.7150029 -1.039268 -1.157892

Selecting multiple noncontiguous columns by integer location can be achievedwith a combination of theiloc indexer attribute andnumpy.r_.

In [4]:named=list("abcdefg")In [5]:n=30In [6]:columns=named+np.arange(len(named),n).tolist()In [7]:df=pd.DataFrame(np.random.randn(n,n),columns=columns)In [8]:df.iloc[:,np.r_[:10,24:30]]Out[8]:           a         b         c  ...        27        28        290  -1.344312  0.844885  1.075770  ...  0.813850  0.132003 -0.8273171  -0.076467 -1.187678  1.130127  ...  0.149748 -0.732339  0.6877382   0.176444  0.403310 -0.154951  ... -0.493662  0.600178  0.2742303   0.132885 -0.023688  2.410179  ...  0.109121  1.126203 -0.9773494   1.474071 -0.064034 -1.282782  ... -0.858447  0.306996 -0.028665..       ...       ...       ...  ...       ...       ...       ...25  1.492125 -0.068190  0.681456  ...  0.428572  0.880609  0.48764526  0.725238  0.624607 -0.141185  ...  1.008500  1.424017  0.71711027  1.262419  1.950057  0.301038  ...  1.007824  2.826008  1.45838328 -1.585746 -0.899734  0.921494  ...  0.577223 -1.088417  0.32668729 -0.986248  0.169729 -1.158091  ... -2.013086 -1.602549  0.333109[30 rows x 16 columns]

aggregate#

In R you may want to split data into subsets and compute the mean for each.Using a data.frame calleddf and splitting it into groupsby1 andby2:

df<-data.frame(v1=c(1,3,5,7,8,3,5,NA,4,5,7,9),v2=c(11,33,55,77,88,33,55,NA,44,55,77,99),by1=c("red","blue",1,2,NA,"big",1,2,"red",1,NA,12),by2=c("wet","dry",99,95,NA,"damp",95,99,"red",99,NA,NA))aggregate(x=df[,c("v1","v2")],by=list(mydf2$by1,mydf2$by2),FUN=mean)

Thegroupby() method is similar to base Raggregatefunction.

In [9]:df=pd.DataFrame(   ...:{   ...:"v1":[1,3,5,7,8,3,5,np.nan,4,5,7,9],   ...:"v2":[11,33,55,77,88,33,55,np.nan,44,55,77,99],   ...:"by1":["red","blue",1,2,np.nan,"big",1,2,"red",1,np.nan,12],   ...:"by2":[   ...:"wet",   ...:"dry",   ...:99,   ...:95,   ...:np.nan,   ...:"damp",   ...:95,   ...:99,   ...:"red",   ...:99,   ...:np.nan,   ...:np.nan,   ...:],   ...:}   ...:)   ...:In [10]:g=df.groupby(["by1","by2"])In [11]:g[["v1","v2"]].mean()Out[11]:            v1    v2by1  by21    95    5.0  55.0     99    5.0  55.02    95    7.0  77.0     99    NaN   NaNbig  damp  3.0  33.0blue dry   3.0  33.0red  red   4.0  44.0     wet   1.0  11.0

For more details and examples seethe groupby documentation.

match /%in%#

A common way to select data in R is using%in% which is defined using thefunctionmatch. The operator%in% is used to return a logical vectorindicating if there is a match or not:

s<-0:4s%in%c(2,4)

Theisin() method is similar to R%in% operator:

In [12]:s=pd.Series(np.arange(5),dtype=np.float32)In [13]:s.isin([2,4])Out[13]:0    False1    False2     True3    False4     Truedtype: bool

Thematch function returns a vector of the positions of matchesof its first argument in its second:

s<-0:4match(s,c(2,4))

For more details and examples seethe reshaping documentation.

tapply#

tapply is similar toaggregate, but data can be in a ragged array,since the subclass sizes are possibly irregular. Using a data.frame calledbaseball, and retrieving information based on the arrayteam:

baseball<-data.frame(team=gl(5,5,labels=paste("Team",LETTERS[1:5])),player=sample(letters,25),batting.average=runif(25,.200,.400))tapply(baseball$batting.average,baseball.example$team,max)

In pandas we may usepivot_table() method to handle this:

In [14]:importrandomIn [15]:importstringIn [16]:baseball=pd.DataFrame(   ....:{   ....:"team":["team%d"%(x+1)forxinrange(5)]*5,   ....:"player":random.sample(list(string.ascii_lowercase),25),   ....:"batting avg":np.random.uniform(0.200,0.400,25),   ....:}   ....:)   ....:In [17]:baseball.pivot_table(values="batting avg",columns="team",aggfunc="max")Out[17]:team           team 1    team 2    team 3    team 4    team 5batting avg  0.352134  0.295327  0.397191  0.394457  0.396194

For more details and examples seethe reshaping documentation.

subset#

Thequery() method is similar to the base Rsubsetfunction. In R you might want to get the rows of adata.frame where onecolumn’s values are less than another column’s values:

df<-data.frame(a=rnorm(10),b=rnorm(10))subset(df,a<=b)df[df$a<=df$b,]# note the comma

In pandas, there are a few ways to perform subsetting. You can usequery() or pass an expression as if it were anindex/slice as well as standard boolean indexing:

In [18]:df=pd.DataFrame({"a":np.random.randn(10),"b":np.random.randn(10)})In [19]:df.query("a <= b")Out[19]:          a         b1  0.174950  0.5528872 -0.023167  0.1480843 -0.495291 -0.3002184 -0.860736  0.1973785 -1.134146  1.7207807 -0.290098  0.0835158  0.238636  0.946550In [20]:df[df["a"]<=df["b"]]Out[20]:          a         b1  0.174950  0.5528872 -0.023167  0.1480843 -0.495291 -0.3002184 -0.860736  0.1973785 -1.134146  1.7207807 -0.290098  0.0835158  0.238636  0.946550In [21]:df.loc[df["a"]<=df["b"]]Out[21]:          a         b1  0.174950  0.5528872 -0.023167  0.1480843 -0.495291 -0.3002184 -0.860736  0.1973785 -1.134146  1.7207807 -0.290098  0.0835158  0.238636  0.946550

For more details and examples seethe query documentation.

with#

An expression using a data.frame calleddf in R with the columnsa andb would be evaluated usingwith like so:

df<-data.frame(a=rnorm(10),b=rnorm(10))with(df,a+b)df$a+df$b# same as the previous expression

In pandas the equivalent expression, using theeval() method, would be:

In [22]:df=pd.DataFrame({"a":np.random.randn(10),"b":np.random.randn(10)})In [23]:df.eval("a + b")Out[23]:0   -0.0914301   -2.4838902   -0.2527283   -0.6264444   -0.2617405    2.1495036   -0.3322147    0.7993318   -2.3772459    2.104677dtype: float64In [24]:df["a"]+df["b"]# same as the previous expressionOut[24]:0   -0.0914301   -2.4838902   -0.2527283   -0.6264444   -0.2617405    2.1495036   -0.3322147    0.7993318   -2.3772459    2.104677dtype: float64

In certain caseseval() will be much faster thanevaluation in pure Python. For more details and examples seethe evaldocumentation.

plyr#

plyr is an R library for the split-apply-combine strategy for dataanalysis. The functions revolve around three data structures in R,aforarrays,l forlists, andd fordata.frame. Thetable below shows how these data structures could be mapped in Python.

R

Python

array

list

lists

dictionary or list of objects

data.frame

dataframe

ddply#

An expression using a data.frame calleddf in R where you want tosummarizex bymonth:

require(plyr)df<-data.frame(x=runif(120,1,168),y=runif(120,7,334),z=runif(120,1.7,20.7),month=rep(c(5,6,7,8),30),week=sample(1:4,120,TRUE))ddply(df,.(month,week),summarize,mean=round(mean(x),2),sd=round(sd(x),2))

In pandas the equivalent expression, using thegroupby() method, would be:

In [25]:df=pd.DataFrame(   ....:{   ....:"x":np.random.uniform(1.0,168.0,120),   ....:"y":np.random.uniform(7.0,334.0,120),   ....:"z":np.random.uniform(1.7,20.7,120),   ....:"month":[5,6,7,8]*30,   ....:"week":np.random.randint(1,4,120),   ....:}   ....:)   ....:In [26]:grouped=df.groupby(["month","week"])In [27]:grouped["x"].agg(["mean","std"])Out[27]:                  mean        stdmonth week5     1      63.653367  40.601965      2      78.126605  53.342400      3      92.091886  57.6301106     1      81.747070  54.339218      2      70.971205  54.687287      3     100.968344  54.0100817     1      61.576332  38.844274      2      61.733510  48.209013      3      71.688795  37.5956388     1      62.741922  34.618153      2      91.774627  49.790202      3      73.936856  60.773900

For more details and examples seethe groupby documentation.

reshape / reshape2#

meltarray#

An expression using a 3 dimensional array calleda in R where you want tomelt it into a data.frame:

a<-array(c(1:23,NA),c(2,3,4))data.frame(melt(a))

In Python, sincea is a list, you can simply use list comprehension.

In [28]:a=np.array(list(range(1,24))+[np.NAN]).reshape(2,3,4)In [29]:pd.DataFrame([tuple(list(x)+[val])forx,valinnp.ndenumerate(a)])Out[29]:    0  1  2     30   0  0  0   1.01   0  0  1   2.02   0  0  2   3.03   0  0  3   4.04   0  1  0   5.0.. .. .. ..   ...19  1  1  3  20.020  1  2  0  21.021  1  2  1  22.022  1  2  2  23.023  1  2  3   NaN[24 rows x 4 columns]

meltlist#

An expression using a list calleda in R where you want to melt itinto a data.frame:

a<-as.list(c(1:4,NA))data.frame(melt(a))

In Python, this list would be a list of tuples, soDataFrame() method would convert it to a dataframe as required.

In [30]:a=list(enumerate(list(range(1,5))+[np.NAN]))In [31]:pd.DataFrame(a)Out[31]:   0    10  0  1.01  1  2.02  2  3.03  3  4.04  4  NaN

For more details and examples seethe Into to Data Structuresdocumentation.

meltdf#

An expression using a data.frame calledcheese in R where you want toreshape the data.frame:

cheese<-data.frame(first=c('John','Mary'),last=c('Doe','Bo'),height=c(5.5,6.0),weight=c(130,150))melt(cheese,id=c("first","last"))

In Python, themelt() method is the R equivalent:

In [32]:cheese=pd.DataFrame(   ....:{   ....:"first":["John","Mary"],   ....:"last":["Doe","Bo"],   ....:"height":[5.5,6.0],   ....:"weight":[130,150],   ....:}   ....:)   ....:In [33]:pd.melt(cheese,id_vars=["first","last"])Out[33]:  first last variable  value0  John  Doe   height    5.51  Mary   Bo   height    6.02  John  Doe   weight  130.03  Mary   Bo   weight  150.0In [34]:cheese.set_index(["first","last"]).stack(future_stack=True)# alternative wayOut[34]:first  lastJohn   Doe   height      5.5             weight    130.0Mary   Bo    height      6.0             weight    150.0dtype: float64

For more details and examples seethe reshaping documentation.

cast#

In Racast is an expression using a data.frame calleddf in R to castinto a higher dimensional array:

df<-data.frame(x=runif(12,1,168),y=runif(12,7,334),z=runif(12,1.7,20.7),month=rep(c(5,6,7),4),week=rep(c(1,2),6))mdf<-melt(df,id=c("month","week"))acast(mdf,week~month~variable,mean)

In Python the best way is to make use ofpivot_table():

In [35]:df=pd.DataFrame(   ....:{   ....:"x":np.random.uniform(1.0,168.0,12),   ....:"y":np.random.uniform(7.0,334.0,12),   ....:"z":np.random.uniform(1.7,20.7,12),   ....:"month":[5,6,7]*4,   ....:"week":[1,2]*6,   ....:}   ....:)   ....:In [36]:mdf=pd.melt(df,id_vars=["month","week"])In [37]:pd.pivot_table(   ....:mdf,   ....:values="value",   ....:index=["variable","week"],   ....:columns=["month"],   ....:aggfunc="mean",   ....:)   ....:Out[37]:month                  5           6           7variable weekx        1     93.888747   98.762034   55.219673         2     94.391427   38.112932   83.942781y        1     94.306912  279.454811  227.840449         2     87.392662  193.028166  173.899260z        1     11.016009   10.079307   16.170549         2      8.476111   17.638509   19.003494

Similarly fordcast which uses a data.frame calleddf in R toaggregate information based onAnimal andFeedType:

df<-data.frame(Animal=c('Animal1','Animal2','Animal3','Animal2','Animal1','Animal2','Animal3'),FeedType=c('A','B','A','A','B','B','A'),Amount=c(10,7,4,2,5,6,2))dcast(df,Animal~FeedType,sum,fill=NaN)# Alternative method using base Rwith(df,tapply(Amount,list(Animal,FeedType),sum))

Python can approach this in two different ways. Firstly, similar to aboveusingpivot_table():

In [38]:df=pd.DataFrame(   ....:{   ....:"Animal":[   ....:"Animal1",   ....:"Animal2",   ....:"Animal3",   ....:"Animal2",   ....:"Animal1",   ....:"Animal2",   ....:"Animal3",   ....:],   ....:"FeedType":["A","B","A","A","B","B","A"],   ....:"Amount":[10,7,4,2,5,6,2],   ....:}   ....:)   ....:In [39]:df.pivot_table(values="Amount",index="Animal",columns="FeedType",aggfunc="sum")Out[39]:FeedType     A     BAnimalAnimal1   10.0   5.0Animal2    2.0  13.0Animal3    6.0   NaN

The second approach is to use thegroupby() method:

In [40]:df.groupby(["Animal","FeedType"])["Amount"].sum()Out[40]:Animal   FeedTypeAnimal1  A           10         B            5Animal2  A            2         B           13Animal3  A            6Name: Amount, dtype: int64

For more details and examples seethe reshaping documentation orthe groupby documentation.

factor#

pandas has a data type for categorical data.

cut(c(1,2,3,4,5,6),3)factor(c(1,2,3,2,2,3))

In pandas this is accomplished withpd.cut andastype("category"):

In [41]:pd.cut(pd.Series([1,2,3,4,5,6]),3)Out[41]:0    (0.995, 2.667]1    (0.995, 2.667]2    (2.667, 4.333]3    (2.667, 4.333]4      (4.333, 6.0]5      (4.333, 6.0]dtype: categoryCategories (3, interval[float64, right]): [(0.995, 2.667] < (2.667, 4.333] < (4.333, 6.0]]In [42]:pd.Series([1,2,3,2,2,3]).astype("category")Out[42]:0    11    22    33    24    25    3dtype: categoryCategories (3, int64): [1, 2, 3]

For more details and examples seecategorical introduction and theAPI documentation. There is also a documentation regarding thedifferences to R’s factor.


[8]ページ先頭

©2009-2025 Movatter.jp