- Notifications
You must be signed in to change notification settings - Fork7
Chunkwise Text-file Processing for 'dplyr'
edwindj/chunked
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
R is a great tool, but processing data in large text files iscumbersome.
chunked
helps you to process large text files withdplyrwhile loading only a part of the data in memory. It builds on theexcellent R packageLaF.
Processing commands are written in dplyr syntax, andchunked
(usingLaF
) will take care that chunk by chunk is processed, taking far lessmemory than otherwise.chunked
is useful forselect-ing columns,mutate-ing columns andfilter-ing rows. It is less helpful ingroup-ing andsummarize-ation of large text files. It can beused in data pre-processing.
‘chunked’ can be installed with
install.packages('chunked')
beta version with:
install.packages('chunked',repos=c('https://cran.rstudio.com','https://edwindj.github.io/drat'))
and the development version with:
devtools::install_github('edwindj/chunked')
Enjoy! Feedback is welcome…
Most common case is processing a large text file, select or add columns,filter it and write the result back to a text file
read_chunkwise("./large_file_in.csv",chunk_size=5000) %>% select(col1,col2,col5) %>% filter(col1>10) %>% mutate(col6=col1+col2) %>% write_chunkwise("./large_file_out.csv")
chunked
will write process the above statement in chunks of 5000records. This is different from for exampleread.csv
which reads alldata into memory before processing it.
Another option is to usechunked
as a preprocessing step before addingit to a database
con<-DBI::dbConnect(RSQLite::SQLite(),'test.db',create=TRUE)db<-dbplyr::src_dbi(con)tbl<- read_chunkwise("./large_file_in.csv",chunk_size=5000) %>% select(col1,col2,col5) %>% filter(col1>10) %>% mutate(col6=col1+col2) %>% write_chunkwise(dbplyr::src_dbi(db),'my_large_table')# tbl now points to the table in sqlite.
Chunked can be used to export chunkwise to a text file. Note howeverthat in that case processing takes place in the database and thechunkwise restrictions only apply to the writing.
chunked
will not start processing untilcollect
orwrite_chunkwise
is called.
data_chunks<- read_chunkwise("./large_file_in.csv",chunk_size=5000) %>% select(col1,col3)# won't start processing untilcollect(data_chunks)# orwrite_chunkwise(data_chunks,"test.csv")# orwrite_chunkwise(data_chunks,db,"test")
Syntax completion of variables of a chunkwise file in RStudio works likea charm…
chunked
implements the following dplyr verbs:
filter
select
rename
mutate
mutate_each
transmute
do
tbl_vars
inner_join
left_join
semi_join
anti_join
Since data is processed in chunks, some dplyr verbs are not implemented:
arrange
right_join
full_join
summarize
andgroup_by
are implemented but generate a warning: theyoperate on each chunk andnot on the whole data set. However thismakes is more easy to process a large file, by repeatedly aggregatingthe resulting data.
summarize
group_by
tmp<- tempfile()write.csv(iris,tmp,row.names=FALSE,quote=FALSE)iris_cw<- read_chunkwise(tmp,chunk_size=30)# read in chunks of 30 rows for this exampleiris_cw %>% group_by(Species) %>%# group in each chunk summarise(m= mean(Sepal.Width)# and summarize in each chunk ,w= n() ) %>%as.data.frame %>%# since each Species has 50 records, results will be in multiple chunks group_by(Species) %>%# group the results from the chunk summarise(m= weighted.mean(m,w))# and summarize it again