Movatterモバイル変換


[0]ホーム

URL:


tidyfst1.8.1

Example 1: Basic usage

Source:vignettes/example1_intro.Rmd
example1_intro.Rmd

Use tidyfst just like dplyr

This part of vignette has referred todplyr’s vignetteinhttps://dplyr.tidyverse.org/articles/dplyr.html. We’lltry to reproduce all the results. First load the needed packages.

library(tidyfst)library(nycflights13)library(data.table)data.table(flights)

Filter rows withfilter_dt()

filter_dt(flights,month==1&day==1)

Note that comma could not be used in the expressions. Which meansfilter_dt(flights, month == 1,day == 1) would return error.## Arrange rows witharrange_dt()

arrange_dt(flights,year,month,day)

Use- (minus symbol) to order a column in descendingorder:

arrange_dt(flights,-arr_delay)

Select columns withselect_dt()

select_dt(flights,year,month,day)

select_dt(flights, year:day) andselect_dt(flights, -(year:day)) are not supported. But Ihave added a feature to help select with regular expression, which meansyou can:

select_dt(flights,"^dep")

The rename process is almost the same as that indplyr:

select_dt(flights, tail_num=tailnum)rename_dt(flights, tail_num=tailnum)

Add new columns withmutate_dt()

mutate_dt(flights,  gain=arr_delay-dep_delay,  speed=distance/air_time*60)

However, if you just create the column, please split them. Thefollowing codes would not work:

mutate_dt(flights,  gain=arr_delay-dep_delay,  gain_per_hour=gain/(air_time/60))

Instead, use:

mutate_dt(flights,gain=arr_delay-dep_delay)%>%mutate_dt(gain_per_hour=gain/(air_time/60))

If you only want to keep the new variables, usetransmute_dt():

transmute_dt(flights,  gain=arr_delay-dep_delay)

Summarise values withsummarise_dt()

summarise_dt(flights,  delay=mean(dep_delay, na.rm=TRUE))

Randomly sample rows withsample_n_dt() andsample_frac_dt()

sample_n_dt(flights,10)sample_frac_dt(flights,0.01)

Grouped operations

For the belowdplyr codes:

by_tailnum<-group_by(flights,tailnum)delay<-summarise(by_tailnum,  count=n(),  dist=mean(distance, na.rm=TRUE),  delay=mean(arr_delay, na.rm=TRUE))delay<-filter(delay,count>20,dist<2000)

We could get it via:

flights%>%summarise_dt( count=.N,  dist=mean(distance, na.rm=TRUE),  delay=mean(arr_delay, na.rm=TRUE),by=tailnum)

summarise_dt (orsummarize_dt) has aparameter “by”, you can specify the group. We could find the number ofplanes and the number of flights that go to each possibledestination:

# the dplyr syntax:# destinations <- group_by(flights, dest)# summarise(destinations,#   planes = n_distinct(tailnum),#   flights = n()# )summarise_dt(flights,planes=uniqueN(tailnum),flights=.N,by=dest)%>%arrange_dt(dest)

If you need to group by many variables, use:

# the dplyr syntax:# daily <- group_by(flights, year, month, day)# (per_day   <- summarise(daily, flights = n()))flights%>%summarise_dt(by=.(year,month,day),flights=.N)# (per_month <- summarise(per_day, flights = sum(flights)))flights%>%summarise_dt(by=.(year,month,day),flights=.N)%>%summarise_dt(by=.(year,month),flights=sum(flights))# (per_year  <- summarise(per_month, flights = sum(flights)))flights%>%summarise_dt(by=.(year,month,day),flights=.N)%>%summarise_dt(by=.(year,month),flights=sum(flights))%>%summarise_dt(by=.(year),flights=sum(flights))

Comparison with data.table syntax

tidyfst provides a tidy syntax fordata.table. Forsuch design,tidyfst never runs faster than the analogousdata.table codes. Nevertheless, it facilitate the dplyr-usersto gain the computation performance in no time and guide them to learnmore about data.table for speed. Below, we’ll compare the syntax oftidyfst anddata.table (referring toIntroductionto data.table). This could let you know how they are different, andlet users to choose their preference. Ideally,tidyfst willlead even more users to learn more aboutdata.table and itswonderful features, so as to design more extentions fortidyfstin the future.

Data

Because we want a more stable data source, here we’ll use the flightdata from the abovenycflights13 package.

Subset rows

# data.tablehead(flights[origin=="JFK"&month==6L])flights[1:2]flights[order(origin,-dest)]# tidyfstflights%>%filter_dt(origin=="JFK"&month==6L)%>%head()flights%>%slice_dt(1:2)flights%>%arrange_dt(origin,-dest)

Select column(s)

# data.tableflights[,list(arr_delay)]flights[,.(arr_delay,dep_delay)]flights[,.(delay_arr=arr_delay, delay_dep=dep_delay)]# tidyfstflights%>%select_dt(arr_delay)flights%>%select_dt(arr_delay,dep_delay)flights%>%transmute_dt(delay_arr=arr_delay, delay_dep=dep_delay)

Mixed computation

# data.tableflights[,sum((arr_delay+dep_delay)<0)]flights[origin=="JFK"&month==6L,.(m_arr=mean(arr_delay), m_dep=mean(dep_delay))]flights[origin=="JFK"&month==6L,length(dest)]flights[origin=="JFK"&month==6L,.N]# tidyfstflights%>%summarise_dt(sum((arr_delay+dep_delay)<0))flights%>%filter_dt(origin=="JFK"&month==6L)%>%summarise_dt(m_arr=mean(arr_delay), m_dep=mean(dep_delay))flights%>%filter_dt(origin=="JFK"&month==6L)%>%nrow()flights%>%filter_dt(origin=="JFK"&month==6L)%>%count_dt()flights%>%filter_dt(origin=="JFK"&month==6L)%>%summarise_dt(.N)

In the above examples, we could learn that intidyfst, youcould still use the methods in data.table, such as.N.

Refer to columns by names

# data.tableflights[,c("arr_delay","dep_delay")]select_cols=c("arr_delay","dep_delay")flights[ ,..select_cols]flights[ ,select_cols, with=FALSE]flights[,!c("arr_delay","dep_delay")]flights[,-c("arr_delay","dep_delay")]# returns year,month and dayflights[,year:day]# returns day, month and yearflights[,day:year]# returns all columns except year, month and dayflights[,-(year:day)]flights[,!(year:day)]# tidyfstflights%>%select_dt(c("arr_delay","dep_delay"))select_cols=c("arr_delay","dep_delay")flights%>%select_dt(cols=select_cols)flights%>%select_dt(-arr_delay,-dep_delay)flights%>%select_dt(year:day)flights%>%select_dt(day:year)flights%>%select_dt(-(year:day))flights%>%select_dt(!(year:day))

Aggregations

# data.tableflights[,.N, by=.(origin)]flights[carrier=="AA",.N, by=origin]flights[carrier=="AA",.N, by=.(origin,dest)]flights[carrier=="AA",.(mean(arr_delay),mean(dep_delay)),        by=.(origin,dest,month)]# tidyfstflights%>%count_dt(origin)# sort by defaultflights%>%filter_dt(carrier=="AA")%>%count_dt(origin)flights%>%filter_dt(carrier=="AA")%>%count_dt(origin,dest)flights%>%filter_dt(carrier=="AA")%>%summarise_dt(mean(arr_delay),mean(dep_delay),               by=.(origin,dest,month))

Note that currentlykeyby is not used intidyfst. This featuer might be included in the future forbetter performance in order-independent tasks. Moreover,count_dt is sorted automatically by the counted number,this could be controlled by the parameter “sort”.

# data.tableflights[carrier=="AA",.N, by=.(origin,dest)][order(origin,-dest)]flights[,.N,.(dep_delay>0,arr_delay>0)]# tidyfstflights%>%filter_dt(carrier=="AA")%>%count_dt(origin,dest,sort=FALSE)%>%arrange_dt(origin,-dest)flights%>%summarise_dt(.N,by=.(dep_delay>0,arr_delay>0))

Now let’s try a more complex example:

# data.tableflights[carrier=="AA",lapply(.SD,mean),        by=.(origin,dest,month),        .SDcols=c("arr_delay","dep_delay")]# tidyfstflights%>%filter_dt(carrier=="AA")%>%group_dt(    by=.(origin,dest,month),at_dt("_delay",summarise_dt,mean))

Let me explain what happens here, especially ingroup_dt. First filter by conditioncarrier == "AA", then group by three variables, which areorigin, dest, month. Last, summarise by columns with“_delay” in the column names and get the mean value of all suchvariables(with “_delay” in their column names). This is a very creativedesign, utilizing.SD indata.table and upgradethegroup_by function indplyr (because you neverneed toungroup now, just put the group operations in thegroup_dt). Andyou can pipe in the group_dtfunction. Let’s play with it a little bit further:

flights%>%filter_dt(carrier=="AA")%>%group_dt(    by=.(origin,dest,month),at_dt("_delay",summarise_dt,mean)%>%mutate_dt(sum=dep_delay+arr_delay))

However, I don’t recommend using it if you don’t acutually need itfor group computation (just start another pipe followsgroup_dt). Now let’s end with some easy examples:

# data.tableflights[,head(.SD,2), by=month]# tidyfstflights%>%group_dt(by=month,head(2))

Deep inside,tidyfst is born fromdplyr anddata.table, and usestringr to make flexible APIs, soas to bring their superiority into full play.


[8]ページ先頭

©2009-2025 Movatter.jp