1

I'm using R with data mining purposes, the thing is that I connected it with elasticsearch and retrieve a dataset of Shakespeare Complete Works.

library("elastic")connect()maxi <- count(index = 'shakespeare')s <- Search(index = 'shakespeare',size=maxi)dat <- s$hits$hits[[1]]$`_source`$text_entryfor (i in 2:maxi) {  dat <- c(dat , s$hits$hits[[i]]$`_source`$text_entry)}rm(s)

Since I only want the dialogue I have to do a for to get only that. The object 's' is around 250 Mb and 'dat' only 10 Mb.

After that I want to do a tf_idf matrix but apparently I can't since it uses too much memory (I have 4GB of RAM), here is my code:

library("tm")myCorpus <- Corpus(VectorSource(dat))myCorpus <- tm_map(myCorpus, content_transformer(tolower),lazy = TRUE)myCorpus <- tm_map(myCorpus, content_transformer(removeNumbers),lazy = TRUE)myCorpus <- tm_map(myCorpus, content_transformer(removePunctuation),lazy = TRUE)myCorpus <- tm_map(myCorpus, content_transformer(removeWords), stopwords("en"),lazy = TRUE)myTdm <- TermDocumentMatrix(myCorpus,control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))

myCorpus is around 400 Mb.

But then I do:

> m <- as.matrix(myTdm)Error in vector(typeof(x$v), nr * nc) : vector size cannot be NAIn addition: Warning message:In nr * nc : NAs produced by integer overflow

Any ideas? Is it too much for R the dataset?

EDIT:

RemoveSparseTerms doesn't works well, I use sparse = 0.95 and It leaves 0 terms:

inspect(myTdm)<<TermDocumentMatrix (terms: 27227, documents: 111396)>>Non-/sparse entries: 410689/3032568203Sparsity           : 100%Maximal term length: 37Weighting          : term frequency (tf)
Brian Tompsett - 汤莱恩's user avatar
Brian Tompsett - 汤莱恩
5,92772 gold badges64 silver badges135 bronze badges
askedJul 9, 2015 at 10:09
EricJ's user avatar

1 Answer1

3

A term document matrix will, in general, contain lots of zeros; lots of terms will only appear in one document. Thetm library stores term document matrices as sparse matrices, which are a space efficient way of storing this type of matrix. (You can read more about the storage format used bytm here:http://127.0.0.1:19303/library/slam/html/matrix.html)

When you try to convert to a regular matrix, this is a lot less space efficient and is making R run out of memory. You can useremoveSparseTerms before you convert to a matrix, to try make the full matrix small enough to work with.

I'm pretty sure this is what is happening but it's hard to know for sure without being able to run your code on your machine.

answeredJul 9, 2015 at 10:36
Mhairi McNeill's user avatar
Sign up to request clarification or add additional context in comments.

7 Comments

Now the problem is that with sparse 0.95 it only leaves like 10 words, so it's too small.
I've often used sparse = 0.999 to get a decent number of words. Try a few different values.
I've noticed in your edit that you seem to have 111,396 documents. Is that right for Shakespere's complete works? I reckon that some of those documents are empty or almost empty, which will cause problems.
Tomorrow I'll check that, and I'll get back. thanks!
Ah, that's a shame. It is going to be tricky if you make each document a sentence - it will lead to very big Term-Document matrices. Have you tried using the big memory package?cran.r-project.org/web/packages/bigmemory/index.html I've never used it but I think it's meant for this kind of situation
|

Your Answer

Sign up orlog in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.