R and data mining not enough memory?

Question 1

I'm using R with data mining purposes, the thing is that I connected it with elasticsearch and retrieve a dataset of Shakespeare Complete Works.

library("elastic")connect()maxi <- count(index = 'shakespeare')s <- Search(index = 'shakespeare',size=maxi)dat <- s$hits$hits[[1]]$`_source`$text_entryfor (i in 2:maxi) {  dat <- c(dat , s$hits$hits[[i]]$`_source`$text_entry)}rm(s)

Since I only want the dialogue I have to do a for to get only that. The object 's' is around 250 Mb and 'dat' only 10 Mb.

After that I want to do a tf_idf matrix but apparently I can't since it uses too much memory (I have 4GB of RAM), here is my code:

library("tm")myCorpus <- Corpus(VectorSource(dat))myCorpus <- tm_map(myCorpus, content_transformer(tolower),lazy = TRUE)myCorpus <- tm_map(myCorpus, content_transformer(removeNumbers),lazy = TRUE)myCorpus <- tm_map(myCorpus, content_transformer(removePunctuation),lazy = TRUE)myCorpus <- tm_map(myCorpus, content_transformer(removeWords), stopwords("en"),lazy = TRUE)myTdm <- TermDocumentMatrix(myCorpus,control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))

myCorpus is around 400 Mb.

But then I do:

> m <- as.matrix(myTdm)Error in vector(typeof(x$v), nr * nc) : vector size cannot be NAIn addition: Warning message:In nr * nc : NAs produced by integer overflow

Any ideas? Is it too much for R the dataset?

EDIT:

RemoveSparseTerms doesn't works well, I use sparse = 0.95 and It leaves 0 terms:

inspect(myTdm)<<TermDocumentMatrix (terms: 27227, documents: 111396)>>Non-/sparse entries: 410689/3032568203Sparsity           : 100%Maximal term length: 37Weighting          : term frequency (tf)

Question 2

A term document matrix will, in general, contain lots of zeros; lots of terms will only appear in one document. Thetm library stores term document matrices as sparse matrices, which are a space efficient way of storing this type of matrix. (You can read more about the storage format used bytm here:http://127.0.0.1:19303/library/slam/html/matrix.html)

When you try to convert to a regular matrix, this is a lot less space efficient and is making R run out of memory. You can useremoveSparseTerms before you convert to a matrix, to try make the full matrix small enough to work with.

I'm pretty sure this is what is happening but it's hard to know for sure without being able to run your code on your machine.

Question 3

Now the problem is that with sparse 0.95 it only leaves like 10 words, so it's too small.

Question 4

I've often used sparse = 0.999 to get a decent number of words. Try a few different values.

Question 5

I've noticed in your edit that you seem to have 111,396 documents. Is that right for Shakespere's complete works? I reckon that some of those documents are empty or almost empty, which will cause problems.

Question 6

Tomorrow I'll check that, and I'll get back. thanks!

Question 7

Ah, that's a shame. It is going to be tricky if you make each document a sentence - it will lead to very big Term-Document matrices. Have you tried using the big memory package?cran.r-project.org/web/packages/bigmemory/index.html I've never used it but I think it's meant for this kind of situation

Mhairi McNeill 2,02114 silver badges21 bronze badges · Accepted Answer · 2015-07-09 10:36:49Z

A term document matrix will, in general, contain lots of zeros; lots of terms will only appear in one document. Thetm library stores term document matrices as sparse matrices, which are a space efficient way of storing this type of matrix. (You can read more about the storage format used bytm here:http://127.0.0.1:19303/library/slam/html/matrix.html)

When you try to convert to a regular matrix, this is a lot less space efficient and is making R run out of memory. You can useremoveSparseTerms before you convert to a matrix, to try make the full matrix small enough to work with.

I'm pretty sure this is what is happening but it's hard to know for sure without being able to run your code on your machine.

Now the problem is that with sparse 0.95 it only leaves like 10 words, so it's too small.
I've often used sparse = 0.999 to get a decent number of words. Try a few different values.
I've noticed in your edit that you seem to have 111,396 documents. Is that right for Shakespere's complete works? I reckon that some of those documents are empty or almost empty, which will cause problems.
Ah, that's a shame. It is going to be tricky if you make each document a sentence - it will lead to very big Term-Document matrices. Have you tried using the big memory package?cran.r-project.org/web/packages/bigmemory/index.html I've never used it but I think it's meant for this kind of situation

Movatterモバイル変換

Collectives™ on Stack Overflow

R and data mining not enough memory?

1 Answer1

7 Comments

Your Answer

Sign up orlog in

Post as a guest

Related

Hot Network Questions

Subscribe to RSS