Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Chunkwise Text-file Processing for 'dplyr'

NotificationsYou must be signed in to change notification settings

edwindj/chunked

Repository files navigation

versionDownloadsR-CMD-checkCoverage StatusR is a great tool, but processing data in large text files iscumbersome.chunked helps you to process large text files withdplyrwhile loading only a part of the data in memory. It builds on theexcellent R packageLaF.

Processing commands are written in dplyr syntax, andchunked (usingLaF) will take care that chunk by chunk is processed, taking far lessmemory than otherwise.chunked is useful forselect-ing columns,mutate-ing columns andfilter-ing rows. It is less helpful ingroup-ing andsummarize-ation of large text files. It can beused in data pre-processing.

Install

‘chunked’ can be installed with

install.packages('chunked')

beta version with:

install.packages('chunked',repos=c('https://cran.rstudio.com','https://edwindj.github.io/drat'))

and the development version with:

devtools::install_github('edwindj/chunked')

Enjoy! Feedback is welcome…

Usage

Text file -> process -> text file

Most common case is processing a large text file, select or add columns,filter it and write the result back to a text file

  read_chunkwise("./large_file_in.csv",chunk_size=5000) %>%   select(col1,col2,col5) %>%  filter(col1>10) %>%   mutate(col6=col1+col2) %>%   write_chunkwise("./large_file_out.csv")

chunked will write process the above statement in chunks of 5000records. This is different from for exampleread.csv which reads alldata into memory before processing it.

Text file -> process -> database

Another option is to usechunked as a preprocessing step before addingit to a database

con<-DBI::dbConnect(RSQLite::SQLite(),'test.db',create=TRUE)db<-dbplyr::src_dbi(con)tbl<-   read_chunkwise("./large_file_in.csv",chunk_size=5000) %>%   select(col1,col2,col5) %>%  filter(col1>10) %>%   mutate(col6=col1+col2) %>%   write_chunkwise(dbplyr::src_dbi(db),'my_large_table')# tbl now points to the table in sqlite.

Db -> process -> Text file

Chunked can be used to export chunkwise to a text file. Note howeverthat in that case processing takes place in the database and thechunkwise restrictions only apply to the writing.

Lazy processing

chunked will not start processing untilcollect orwrite_chunkwiseis called.

data_chunks<-   read_chunkwise("./large_file_in.csv",chunk_size=5000) %>%   select(col1,col3)# won't start processing untilcollect(data_chunks)# orwrite_chunkwise(data_chunks,"test.csv")# orwrite_chunkwise(data_chunks,db,"test")

Syntax completion of variables of a chunkwise file in RStudio works likea charm…

Dplyr verbs

chunked implements the following dplyr verbs:

  • filter
  • select
  • rename
  • mutate
  • mutate_each
  • transmute
  • do
  • tbl_vars
  • inner_join
  • left_join
  • semi_join
  • anti_join

Since data is processed in chunks, some dplyr verbs are not implemented:

  • arrange
  • right_join
  • full_join

summarize andgroup_by are implemented but generate a warning: theyoperate on each chunk andnot on the whole data set. However thismakes is more easy to process a large file, by repeatedly aggregatingthe resulting data.

  • summarize
  • group_by
tmp<- tempfile()write.csv(iris,tmp,row.names=FALSE,quote=FALSE)iris_cw<- read_chunkwise(tmp,chunk_size=30)# read in chunks of 30 rows for this exampleiris_cw %>%   group_by(Species) %>%# group in each chunk  summarise(m= mean(Sepal.Width)# and summarize in each chunk           ,w= n()           ) %>%as.data.frame %>%# since each Species has 50 records, results will be in multiple chunks  group_by(Species) %>%# group the results from the chunk  summarise(m= weighted.mean(m,w))# and summarize it again

About

Chunkwise Text-file Processing for 'dplyr'

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp