Tricks to manage the available memory in an R session

Question 1

What tricks do people use to manage the available memory of an interactive R session? I use the functions below [based on postings by Petr Pikal and David Hinds to the r-help list in 2004] to list (and/or sort) the largest objects and to occassionallyrm() some of them. But by far the most effective solution was ... to run under 64-bit Linux with ample memory.

Any other nice tricks folks want to share? One per post, please.

# improved list of objects.ls.objects <- function (pos = 1, pattern, order.by,                        decreasing=FALSE, head=FALSE, n=5) {    napply <- function(names, fn) sapply(names, function(x)                                         fn(get(x, pos = pos)))    names <- ls(pos = pos, pattern = pattern)    obj.class <- napply(names, function(x) as.character(class(x))[1])    obj.mode <- napply(names, mode)    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)    obj.size <- napply(names, object.size)    obj.dim <- t(napply(names, function(x)                        as.numeric(dim(x))[1:2]))    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")    obj.dim[vec, 1] <- napply(names, length)[vec]    out <- data.frame(obj.type, obj.size, obj.dim)    names(out) <- c("Type", "Size", "Rows", "Columns")    if (!missing(order.by))        out <- out[order(out[[order.by]], decreasing=decreasing), ]    if (head)        out <- head(out, n)    out}# shorthandlsos <- function(..., n=10) {    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)}

Question 2

Note, I do NOT doubt it, but what's the use of that? I am pretty new to memory problems in R, but I am experiencing some lately (that's why I was searching for this post:) – so am I just starting with all this. How does this help my daily work?

Question 3

if you want to see the objects within a function, you have to use: lsos(pos = environment()), otherwise it'll only show global variables. To write to standard error: write.table(lsos(pos=environment()), stderr(), quote=FALSE, sep='\t')

Question 4

Why 64-bit linux and not 64-bit Windows? Does the choice of OS make a non-trivial difference when I have 32GB of ram to use?

Question 5

@pepsimax: This has been packaged in themultilevelPSA package. The package is designed for something else, but you can use the function from there without loading the package by sayingrequireNamespace(multilevelPSA); multilevelPSA::lsos(...). Or in theDmisc package (not on CRAN).

Question 6

If the data set is of a manageable size, I usually go to R studio>Environment>Grid View. Here you can see and sort all items in your current environment based on the size.

Question 7

Ensure you record your work in a reproducible script. From time-to-time, reopen R, thensource() your script. You'll clean out anything you're no longer using, and as an added benefit will have tested your code.

Question 8

My strategy is to break my scripts up along the lines of load.R and do.R, where load.R may take quite some time to load in data from files or a database, and does any bare minimum pre-processing/merging of that data. The last line of load.R is something to save the workspace state. Then do.R is my scratchpad whereby I build out my analysis functions. I frequently reload do.R (with or without reloading the workspace state from load.R as needed).

Question 9

That's a good technique. When files are run in a certain order like that, I often prefix them with a number:1-load.r,2-explore.r,3-model.r - that way it's obvious to others that there is some order present.

Question 10

I can't back this idea up enough. I've taught R to a few people and this is one of first things I say. This also applies to any language where development incorporates a REPL and a file being edited (i.e. Python). rm(ls=list()) and source() works too, but re-opening is better (packages cleared too).

Question 11

The fact that the top-voted answer involves restarting R is the worst criticism of R possible.

Question 12

@MartínBel that only removes objects created in the global environment. It does not unload packages or S4 objects or many other things.

Question 13

I use thedata.table package. With its:= operator you can :

Add columns by reference
Modify subsets of existing columns by reference, and by group by reference
Delete columns by reference

None of these operations copy the (potentially large)data.table at all, not even once.

Aggregation is also particularly fast becausedata.table uses much less working memory.

Related links :

Question 14

Saw this on a twitter post and think it's an awesome function by Dirk! Following on fromJD Long's answer, I would do this for user friendly reading:

# improved list of objects.ls.objects <- function (pos = 1, pattern, order.by,                        decreasing=FALSE, head=FALSE, n=5) {    napply <- function(names, fn) sapply(names, function(x)                                         fn(get(x, pos = pos)))    names <- ls(pos = pos, pattern = pattern)    obj.class <- napply(names, function(x) as.character(class(x))[1])    obj.mode <- napply(names, mode)    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)    obj.prettysize <- napply(names, function(x) {                           format(utils::object.size(x), units = "auto") })    obj.size <- napply(names, object.size)    obj.dim <- t(napply(names, function(x)                        as.numeric(dim(x))[1:2]))    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")    obj.dim[vec, 1] <- napply(names, length)[vec]    out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)    names(out) <- c("Type", "Size", "PrettySize", "Length/Rows", "Columns")    if (!missing(order.by))        out <- out[order(out[[order.by]], decreasing=decreasing), ]    if (head)        out <- head(out, n)    out}    # shorthandlsos <- function(..., n=10) {    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)}lsos()

Which results in something like the following:

                      Type   Size PrettySize Length/Rows Columnspca.res                 PCA 790128   771.6 Kb          7      NADF               data.frame 271040   264.7 Kb        669      50factor.AgeGender   factanal  12888    12.6 Kb         12      NAdates            data.frame   9016     8.8 Kb        669       2sd.                 numeric   3808     3.7 Kb         51      NAnapply             function   2256     2.2 Kb         NA      NAlsos               function   1944     1.9 Kb         NA      NAload               loadings   1768     1.7 Kb         12       2ind.sup             integer    448  448 bytes        102      NAx                 character     96   96 bytes          1      NA

NOTE: The main part I added was (again, adapted from JD's answer) :

obj.prettysize <- napply(names, function(x) {                           print(object.size(x), units = "auto") })

Question 15

can this function be added to dplyr or some other key package.

Question 16

Worth noting that (at least with base-3.3.2)capture.output is not neccessary anymore, andobj.prettysize <- napply(names, function(x) {format(utils::object.size(x), units = "auto") }) produces clean output. In fact, not removing it produces unwanted quotes in the output, i.e.[1] "792.5 Mb" instead of792.5 Mb.

Question 17

@Nutle Excellent, I've updated the code accordingly :)

Question 18

I'd also changeobj.class <- napply(names, function(x) as.character(class(x))[1]) toobj.class <- napply(names, function(x) class(x)[1]) sinceclass always return a vector of characters now (base-3.5.0).

Question 19

Any idea as how to point theimproved list of objects to a specific environment?

Question 20

I make aggressive use of thesubset parameter with selection of only the required variables when passing dataframes to thedata= argument of regression functions. It does result in some errors if I forget to add variables to both the formula and theselect= vector, but it still saves a lot of time due to decreased copying of objects and reduces the memory footprint significantly. Say I have 4 million records with 110 variables (and I do.) Example:

# library(rms); library(Hmisc) for the cph,and rcs functionsMayo.PrCr.rbc.mdl <- cph(formula = Surv(surv.yr, death) ~ age + Sex + nsmkr + rcs(Mayo, 4) +                                      rcs(PrCr.rat, 3) +  rbc.cat * Sex,      data = subset(set1HLI,  gdlab2 & HIVfinal == "Negative",                            select = c("surv.yr", "death", "PrCr.rat", "Mayo",                                       "age", "Sex", "nsmkr", "rbc.cat")   )            )

By way of setting context and the strategy: thegdlab2 variable is a logical vector that was constructed for subjects in a dataset that had all normal or almost normal values for a bunch of laboratory tests andHIVfinal was a character vector that summarized preliminary and confirmatory testing for HIV.

Question 21

I love Dirk's .ls.objects() script but I kept squinting to count characters in the size column. So I did some ugly hacks to make it present with pretty formatting for the size:

.ls.objects <- function (pos = 1, pattern, order.by,                        decreasing=FALSE, head=FALSE, n=5) {    napply <- function(names, fn) sapply(names, function(x)                                         fn(get(x, pos = pos)))    names <- ls(pos = pos, pattern = pattern)    obj.class <- napply(names, function(x) as.character(class(x))[1])    obj.mode <- napply(names, mode)    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)    obj.size <- napply(names, object.size)    obj.prettysize <- sapply(obj.size, function(r) prettyNum(r, big.mark = ",") )    obj.dim <- t(napply(names, function(x)                        as.numeric(dim(x))[1:2]))    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")    obj.dim[vec, 1] <- napply(names, length)[vec]    out <- data.frame(obj.type, obj.size,obj.prettysize, obj.dim)    names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")    if (!missing(order.by))        out <- out[order(out[[order.by]], decreasing=decreasing), ]        out <- out[c("Type", "PrettySize", "Rows", "Columns")]        names(out) <- c("Type", "Size", "Rows", "Columns")    if (head)        out <- head(out, n)    out}

Question 22

That's a good trick.

One other suggestion is to use memory efficient objects wherever possible: for instance, use a matrix instead of a data.frame.

This doesn't really address memory management, but one important function that isn't widely known is memory.limit(). You can increase the default using this command, memory.limit(size=2500), where the size is in MB. As Dirk mentioned, you need to be using 64-bit in order to take real advantage of this.

Question 23

Isn't this only applicable to Windows?

Question 24

> memory.limit() [1] Inf Warning message: 'memory.limit()' is Windows-specific

Question 25

Does using tibble instead of data.frame aid us even better to save memory ?

Question 26

I quite like the improved objects function developed by Dirk. Much of the time though, a more basic output with the object name and size is sufficient for me. Here's a simpler function with a similar objective. Memory use can be ordered alphabetically or by size, can be limited to a certain number of objects, and can be ordered ascending or descending. Also, I often work with data that are 1GB+, so the function changes units accordingly.

showMemoryUse <- function(sort="size", decreasing=FALSE, limit) {  objectList <- ls(parent.frame())  oneKB <- 1024  oneMB <- 1048576  oneGB <- 1073741824  memoryUse <- sapply(objectList, function(x) as.numeric(object.size(eval(parse(text=x)))))  memListing <- sapply(memoryUse, function(size) {        if (size >= oneGB) return(paste(round(size/oneGB,2), "GB"))        else if (size >= oneMB) return(paste(round(size/oneMB,2), "MB"))        else if (size >= oneKB) return(paste(round(size/oneKB,2), "kB"))        else return(paste(size, "bytes"))      })  memListing <- data.frame(objectName=names(memListing),memorySize=memListing,row.names=NULL)  if (sort=="alphabetical") memListing <- memListing[order(memListing$objectName,decreasing=decreasing),]   else memListing <- memListing[order(memoryUse,decreasing=decreasing),] #will run if sort not specified or "size"  if(!missing(limit)) memListing <- memListing[1:limit,]  print(memListing, row.names=FALSE)  return(invisible(memListing))}

And here is some example output:

> showMemoryUse(decreasing=TRUE, limit=5)      objectName memorySize       coherData  713.75 MB spec.pgram_mine  149.63 kB       stoch.reg  145.88 kB      describeBy    82.5 kB      lmBandpass   68.41 kB

Question 27

I never save an R workspace. I use import scripts and data scripts and output any especially large data objects that I don't want to recreate often to files. This way I always start with a fresh workspace and don't need to clean out large objects. That is a very nice function though.

Question 28

Unfortunately I did not have time to test it extensively but here is a memory tip that I have not seen before. For me the required memory was reduced with more than 50%.When you read stuff into R with for example read.csv they require a certain amount of memory.After this you can save them withsave("Destinationfile",list=ls())The next time you open R you can useload("Destinationfile")Now the memory usage might have decreased.It would be nice if anyone could confirm whether this produces similar results with a different dataset.

Question 29

yes, I experienced the same. The memory usage drops even to 30% in my case. 1.5GB memory used, saved to .RData (~30MB). New session after loading .RData uses less than 500MB of memory.

Question 30

I tried with 2 datasets (100MB and 2.7GB) loaded into data.table usingfread, then saved to .RData. The RData files were indeed about 70% smaller but after re-loading, the memory used were exactly the same. Was hoping this trick will reduce the memory footprint... am I missing something?

Question 31

@NoviceProg I don't think that you are missing something, but it is a trick, I guess it will not work for all situations. In my case the memory after re loading was actually reduced as described.

Question 32

@NoviceProg A couple things. First, fread, following data.table's credo is probably more memory efficient in loading files than is read.csv. Second, the memory savings people are noting here primarily have to do with the memory size of the R process (which expands to hold objects and retracts when garbage collection takes place). However, garbage collection does not always release all of the RAM back to the OS. Stopping the R session and loading the item from where it has been stored will release as much RAM as is possible... but if the overhead was small to begin with ... no gain.

Question 33

To further illustrate the common strategy of frequent restarts, we can uselittler which allows us to run simple expressions directly from the command-line. Here is an example I sometimes use to time different BLAS for a simple crossprod.

 r -e'N<-3*10^3; M<-matrix(rnorm(N*N),ncol=N); print(system.time(crossprod(M)))'

Likewise,

 r -lMatrix -e'example(spMatrix)'

loads the Matrix package (via the --packages | -l switch) and runs the examples of the spMatrix function. As r always starts 'fresh', this method is also a good test during package development.

Last but not least r also work great for automated batch mode in scripts using the '#!/usr/bin/r' shebang-header. Rscript is an alternative where littler is unavailable (e.g. on Windows).

Question 34

you pose a general question and then accept your own quite particular answer, even though a number of other constributions were much higher ranked?

Question 35

For both speed and memory purposes, when building a large data frame via some complex series of steps, I'll periodically flush it (the in-progress data set being built) to disk, appending to anything that came before, and then restart it. This way the intermediate steps are only working on smallish data frames (which is good as, e.g.,rbind slows down considerably with larger objects). The entire data set can be read back in at the end of the process, when all the intermediate objects have been removed.

dfinal <- NULLfirst <- TRUEtempfile <- "dfinal_temp.csv"for( i in bigloop ) {    if( !i %% 10000 ) {         print( i, "; flushing to disk..." )        write.table( dfinal, file=tempfile, append=!first, col.names=first )        first <- FALSE        dfinal <- NULL   # nuke it    }    # ... complex operations here that add data to 'dfinal' data frame  }print( "Loop done; flushing to disk and re-reading entire data set..." )write.table( dfinal, file=tempfile, append=TRUE, col.names=FALSE )dfinal <- read.table( tempfile )

Question 36

Just to note thatdata.table package'stables() seems to be a pretty good replacement for Dirk's.ls.objects() custom function (detailed in earlier answers), although just for data.frames/tables and not e.g. matrices, arrays, lists.

Question 37

this does not list any data.frames so it is not that great

Question 38

I'm fortunate and my large data sets are saved by the instrument in "chunks" (subsets) of roughly 100 MB (32bit binary). Thus I can do pre-processing steps (deleting uninformative parts, downsampling) sequentially before fusing the data set.
Callinggc () "by hand" can help if the size of the data get close to available memory.
Sometimes a different algorithm needs much less memory.
Sometimes there's a trade off between vectorization and memory use.
compare:split &lapply vs. afor loop.
For the sake of fast & easy data analysis, I often work first with a small random subset (sample ()) of the data. Once the data analysis script/.Rnw is finished data analysis code and the complete data go to the calculation server for over night / over weekend / ... calculation.

Question 39

The use of environments instead of lists to handle collections of objects which occupy a significant amount of working memory.

The reason: each time an element of alist structure is modified, the whole list is temporarily duplicated. This becomes an issue if the storage requirement of the list is about half the available working memory, because then data has to be swapped to the slow hard disk. Environments, on the other hand, aren't subject to this behaviour and they can be treated similar to lists.

Here is an example:

get.data <- function(x){  # get some data based on x  return(paste("data from",x))}collect.data <- function(i,x,env){  # get some data  data <- get.data(x[[i]])  # store data into environment  element.name <- paste("V",i,sep="")  env[[element.name]] <- data  return(NULL)  }better.list <- new.env()filenames <- c("file1","file2","file3")lapply(seq_along(filenames),collect.data,x=filenames,env=better.list)# read/write accessprint(better.list[["V1"]])better.list[["V2"]] <- "testdata"# number of list elementslength(ls(better.list))

In conjunction with structures such asbig.matrix ordata.table which allow for altering their content in-place, very efficient memory usage can be achieved.

Question 40

This is no longer true: from Hadley'sadvanced R, "Changes to R 3.1.0 have made this use [of environments] substantially less important because modifying a list no longer makes a deep copy."

Question 41

Thellfunction ingData package can show the memory usage of each object as well.

gdata::ll(unit='MB')

Question 42

Not on my system: R version 3.1.1 (2014-07-10), x86_64-pc-linux-gnu (64-bit), gdata_2.13.3, gtools_3.4.1.

Question 43

You are right I test it once it was ordered by chance!

Question 44

please modify the function to use Gb, Mb

Question 45

If you really want to avoid the leaks, you should avoid creating any big objects in the global environment.

What I usually do is to have a function that does the job and returnsNULL — all data is read and manipulated in this function or others that it calls.

Question 46

With only 4GB of RAM (running Windows 10, so make that about 2 or more realistically 1GB) I've had to be real careful with the allocation.

I use data.table almost exclusively.

The 'fread' function allows you to subset information by field names on import; only import the fields that are actually needed to begin with. If you're using base R read, null the spurious columns immediately after import.

As42- suggests, where ever possible I will then subset within the columns immediately after importing the information.

I frequently rm() objects from the environment as soon as they're no longer needed, e.g. on the next line after using them to subset something else, and call gc().

'fread' and 'fwrite' from data.table can bevery fast by comparison with base R reads and writes.

Askpierce8 suggests, I almost always fwrite everything out of the environment and fread it back in, even with thousand / hundreds of thousands of tiny files to get through. This not only keeps the environment 'clean' and keeps the memory allocation low but, possibly due to the severe lack of RAM available, R has a propensity for frequently crashing on my computer; really frequently. Having the information backed up on the drive itself as the code progresses through various stages means I don't have to start right from the beginning if it crashes.

As of 2017, I think the fastest SSDs are running around a few GB per second through the M2 port. I have a really basic 50GB Kingston V300 (550MB/s) SSD that I use as my primary disk (has Windows and R on it). I keep all the bulk information on a cheap 500GB WD platter. I move the data sets to the SSD when I start working on them. This, combined with 'fread'ing and 'fwrite'ing everything has been working out great. I've tried using 'ff' but prefer the former. 4K read/write speeds can create issues with this though; backing up a quarter of a million 1k files (250MBs worth) from the SSD to the platter can take hours. As far as I'm aware, there isn't any R package available yet that can automatically optimise the 'chunkification' process; e.g. look at how much RAM a user has, test the read/write speeds of the RAM / all the drives connected and then suggest an optimal 'chunkification' protocol. This could produce some significant workflow improvements / resource optimisations; e.g. split it to ... MB for the ram -> split it to ... MB for the SSD -> split it to ... MB on the platter -> split it to ... MB on the tape. It could sample data sets beforehand to give it a more realistic gauge stick to work from.

A lot of the problems I've worked on in R involve forming combination and permutation pairs, triples etc, which only makes having limited RAM more of a limitation as they will oftenat least exponentially expand at some point. This has made me focus a lot of attention on thequality as opposed toquantity of information going into them to begin with, rather than trying to clean it up afterwards, and on the sequence of operations in preparing the information to begin with (starting with the simplest operation and increasing the complexity); e.g. subset, then merge / join, then form combinations / permutations etc.

There do seem to be some benefits to using base R read and write in some instances. For instance, the error detection within 'fread' is so good it can be difficult trying to get really messy information into R to begin with to clean it up. Base R also seems to be a lot easier if you're using Linux. Base R seems to work fine in Linux, Windows 10 uses ~20GB of disc space whereas Ubuntu only needs a few GB, the RAM needed with Ubuntu is slightly lower. But I've noticed large quantities of warnings and errors when installing third party packages in (L)Ubuntu. I wouldn't recommend drifting too far away from (L)Ubuntu or other stock distributions with Linux as you can loose so much overall compatibility it renders the process almost pointless (I think 'unity' is due to be cancelled in Ubuntu as of 2017). I realise this won't go down well with some Linux users but some of the custom distributions are borderline pointless beyond novelty (I've spent years using Linux alone).

Hopefully some of that might help others out.

Question 47

This is a newer answer to this excellent old question. From Hadley's Advanced R:

install.packages("pryr")library(pryr)object_size(1:10)## 88 Bobject_size(mean)## 832 Bobject_size(mtcars)## 6.74 kB

(http://adv-r.had.co.nz/memory.html)

Question 48

This adds nothing to the above, but is written in the simple and heavily commented style that I like. It yields a table with the objects ordered in size , but without some of the detail given in the examples above:

#Find the objects       MemoryObjects = ls()    #Create an arrayMemoryAssessmentTable=array(NA,dim=c(length(MemoryObjects),2))#Name the columnscolnames(MemoryAssessmentTable)=c("object","bytes")#Define the first column as the objectsMemoryAssessmentTable[,1]=MemoryObjects#Define a function to determine size        MemoryAssessmentFunction=function(x){object.size(get(x))}#Apply the function to the objectsMemoryAssessmentTable[,2]=t(t(sapply(MemoryAssessmentTable[,1],MemoryAssessmentFunction)))#Produce a table with the largest objects firstnoquote(MemoryAssessmentTable[rev(order(as.numeric(MemoryAssessmentTable[,2]))),])

Question 49

As well as the more general memory management techniques given in the answers above, I always try to reduce the size of my objects as far as possible. For example, I work with very large but very sparse matrices, in other words matrices where most values are zero. Using the 'Matrix' package (capitalisation important) I was able to reduce my average object sizes from ~2GB to ~200MB as simply as:

my.matrix <- Matrix(my.matrix)

The Matrix package includes data formats that can be used exactly like a regular matrix (no need to change your other code) but are able to store sparse data much more efficiently, whether loaded into memory or saved to disk.

Additionally, the raw files I receive are in 'long' format where each data point has variablesx, y, z, i. Much more efficient to transform the data into anx * y * z dimension array with only variablei.

Know your data and use a bit of common sense.

Question 50

If you are working onLinux and want to useseveral processes and only have to doread operations on one or morelarge objects usemakeForkCluster instead of amakePSOCKcluster. This also saves you the time sending the large object to the other processes.

Question 51

I really appreciate some of the answers above, following @hadley and @Dirk that suggest closing R and issuingsource and using command line I come up with a solution that worked very well for me. I had to deal with hundreds of mass spectras, each occupies around 20 Mb of memory so I used two R scripts, as follows:

First a wrapper:

#!/usr/bin/Rscript --vanilla --default-packages=utilsfor(l in 1:length(fdir)) {   for(k in 1:length(fds)) {     system(paste("Rscript runConsensus.r", l, k))   }}

with this script I basically control what my main script dorunConsensus.r, and I write the data answer for the output. With this, each time the wrapper calls the script it seems the R is reopened and the memory is freed.

Hope it helps.

Question 52

Tip for dealing with objects requiring heavy intermediate calculation: When using objects that require a lot of heavy calculation and intermediate steps to create, I often find it useful to write a chunk of code with the function to create the object, and then a separate chunk of code that gives me the option either to generate and save the object as anrmd file, or load it externally from anrmd file I have already previously saved. This is especially easy to do inR Markdown using the following code-chunk structure.

```{r Create OBJECT}COMPLICATED.FUNCTION <- function(...) { Do heavy calculations needing lots of memory;                                        Output OBJECT; }``````{r Generate or load OBJECT}LOAD <- TRUESAVE <- TRUE#NOTE: Set LOAD to TRUE if you want to load saved file#NOTE: Set LOAD to FALSE if you want to generate the object from scratch#NOTE: Set SAVE to TRUE if you want to save the object externallyif(LOAD) {   OBJECT <- readRDS(file = 'MySavedObject.rds') } else {  OBJECT <- COMPLICATED.FUNCTION(x, y, z)  if (SAVE) { saveRDS(file = 'MySavedObject.rds', object = OBJECT) } }```

With this code structure, all I need to do is to changeLOAD depending on whether I want to generate the object, or load it directly from an existing saved file. (Of course, I have to generate it and save it the first time, but after this I have the option of loading it.) SettingLOAD <- TRUE bypasses use of my complicated function and avoids all of the heavy computation therein. This method still requires enough memory to store the object of interest, but it saves you from having to calculate it each time you run your code. For objects that require a lot of heavy calculation of intermediate steps (e.g., for calculations involving loops over large arrays) this can save a substantial amount of time and computation.

Question 53

Running

for (i in 1:10)     gc(reset = T)

from time to time also helps R to free unused but still not released memory.

Question 54

What does thefor loop do here? There's noi in thegc call.

Question 55

@qqq it is there just to avoid copy-pastegc(reset = T) nine times

Question 56

But why would you run it 9 times? (curious, not critical)

Question 57

You also can get some benefit using knitr and puting your script in Rmd chuncks.

I usually divide the code in different chunks and select which one will save a checkpoint to cache or to a RDS file, and

Over there you can set a chunk to be saved to "cache", or you can decide to run or not a particular chunk. In this way, in a first run you can process only "part 1", another execution you can select only "part 2", etc.

Example:

part1```{r corpus, warning=FALSE, cache=TRUE, message=FALSE, eval=TRUE}corpusTw <- corpus(twitter)  # build the corpus```part2```{r trigrams, warning=FALSE, cache=TRUE, message=FALSE, eval=FALSE}dfmTw <- dfm(corpusTw, verbose=TRUE, removeTwitter=TRUE, ngrams=3)```

As a side effect, this also could save you some headaches in terms of reproducibility :)

Question 58

Based on @Dirk's and @Tony's answer I have made a slight update. The result was outputting[1] before the pretty size values, so I took out thecapture.output which solved the problem:

.ls.objects <- function (pos = 1, pattern, order.by,                     decreasing=FALSE, head=FALSE, n=5) {napply <- function(names, fn) sapply(names, function(x)    fn(get(x, pos = pos)))names <- ls(pos = pos, pattern = pattern)obj.class <- napply(names, function(x) as.character(class(x))[1])obj.mode <- napply(names, mode)obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)obj.prettysize <- napply(names, function(x) {    format(utils::object.size(x),  units = "auto") })obj.size <- napply(names, utils::object.size)obj.dim <- t(napply(names, function(x)    as.numeric(dim(x))[1:2]))vec <- is.na(obj.dim)[, 1] & (obj.type != "function")obj.dim[vec, 1] <- napply(names, length)[vec]out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")if (!missing(order.by))    out <- out[order(out[[order.by]], decreasing=decreasing), ]if (head)    out <- head(out, n)return(out)}# shorthandlsos <- function(..., n=10) {    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)}lsos()

Question 59

I try to keep the amount of objects small when working in a larger project with a lot of intermediate steps. So instead of creating many unique objects called

dataframe->step1 ->step2 ->step3 ->result

raster->multipliedRast ->meanRastF ->sqrtRast ->resultRast

I work with temporary objects that I calltemp.

dataframe ->temp ->temp ->temp ->result

Which leaves me with less intermediate files and more overview.

raster  <- raster('file.tif')temp <- raster * 10temp <- mean(temp)resultRast <- sqrt(temp)

To save more memory I can simply removetemp when no longer needed.

rm(temp)

If I need several intermediate files, I usetemp1,temp2,temp3.

For testing I usetest,test2, ...

Question 60

rm(list=ls()) is a great way to keep you honest and keep things reproducible.

Question 61

No, there is a fairly well established consensus that that is not a good recommendation. See e.g.this often-quoted tweet / statement. I just start from many fresh R processes at the command-line which has the same effect and zero risk of accidentatlly deleting hours or works of work in another long-lived session.

hadley 104k35 gold badges186 silver badges248 bronze badges · Accepted Answer · 2009-08-31 16:09:59Z

215

Answer recommended byR Language Collective

Ensure you record your work in a reproducible script. From time-to-time, reopen R, thensource() your script. You'll clean out anything you're no longer using, and as an added benefit will have tested your code.

Share

Improve this answer

answeredAug 31, 2009 at 16:09

hadley

104k35 gold badges186 silver badges248 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

Josh Reich

Josh Reich Over a year ago

My strategy is to break my scripts up along the lines of load.R and do.R, where load.R may take quite some time to load in data from files or a database, and does any bare minimum pre-processing/merging of that data. The last line of load.R is something to save the workspace state. Then do.R is my scratchpad whereby I build out my analysis functions. I frequently reload do.R (with or without reloading the workspace state from load.R as needed).

2009-09-01T16:33:42.653Z+00:00

hadley

hadley Over a year ago

That's a good technique. When files are run in a certain order like that, I often prefix them with a number:1-load.r,2-explore.r,3-model.r - that way it's obvious to others that there is some order present.

2009-09-04T13:02:50.03Z+00:00

Vince

Vince Over a year ago

I can't back this idea up enough. I've taught R to a few people and this is one of first things I say. This also applies to any language where development incorporates a REPL and a file being edited (i.e. Python). rm(ls=list()) and source() works too, but re-opening is better (packages cleared too).

2010-06-15T05:31:44.433Z+00:00

sds

sds Over a year ago

The fact that the top-voted answer involves restarting R is the worst criticism of R possible.

2013-07-15T20:44:22.21Z+00:00

hadley

hadley Over a year ago

@MartínBel that only removes objects created in the global environment. It does not unload packages or S4 objects or many other things.

2013-12-19T14:09:23.443Z+00:00

|

Movatterモバイル変換

Collectives™ on Stack Overflow

Tricks to manage the available memory in an R session

28 Answers28

15 Comments

Comments

5 Comments

Comments

Comments

3 Comments

Comments

Comments

4 Comments

1 Comment

Comments

1 Comment

Comments

1 Comment

3 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

3 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up orlog in

Post as a guest

Linked

Related

Hot Network Questions

Subscribe to RSS