R and data mining not enough memory?

EricJ

I'm using R with data mining purposes, the thing is that I connected it with elasticsearch and retrieve a dataset of Shakespeare Complete Works.

library("elastic")
connect()
maxi <- count(index = 'shakespeare')
s <- Search(index = 'shakespeare',size=maxi)

dat <- s$hits$hits[[1]]$`_source`$text_entry
for (i in 2:maxi) {
  dat <- c(dat , s$hits$hits[[i]]$`_source`$text_entry)
}
rm(s)

Since I only want the dialogue I have to do a for to get only that. The object 's' is around 250 Mb and 'dat' only 10 Mb.

After that I want to do a tf_idf matrix but apparently I can't since it uses too much memory (I have 4GB of RAM), here is my code:

library("tm")
myCorpus <- Corpus(VectorSource(dat))
myCorpus <- tm_map(myCorpus, content_transformer(tolower),lazy = TRUE)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumbers),lazy = TRUE)
myCorpus <- tm_map(myCorpus, content_transformer(removePunctuation),lazy = TRUE)
myCorpus <- tm_map(myCorpus, content_transformer(removeWords), stopwords("en"),lazy = TRUE)
myTdm <- TermDocumentMatrix(myCorpus,control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))

myCorpus is around 400 Mb.

But then I do:

> m <- as.matrix(myTdm)
Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow

Any ideas? Is it too much for R the dataset?

EDIT:

RemoveSparseTerms doesn't works well, I use sparse = 0.95 and It leaves 0 terms:

inspect(myTdm)
<<TermDocumentMatrix (terms: 27227, documents: 111396)>>
Non-/sparse entries: 410689/3032568203
Sparsity           : 100%
Maximal term length: 37
Weighting          : term frequency (tf)
Mhairi McNeill

A term document matrix will, in general, contain lots of zeros; lots of terms will only appear in one document. The tm library stores term document matrices as sparse matrices, which are a space efficient way of storing this type of matrix. (You can read more about the storage format used by tm here: http://127.0.0.1:19303/library/slam/html/matrix.html)

When you try to convert to a regular matrix, this is a lot less space efficient and is making R run out of memory. You can use removeSparseTerms before you convert to a matrix, to try make the full matrix small enough to work with.

I'm pretty sure this is what is happening but it's hard to know for sure without being able to run your code on your machine.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Text Mining in R | memory management

From Dev

correlation matrix using large data sets in R when ff matrix memory allocation is not enough

From Dev

correlation matrix using large data sets in R when ff matrix memory allocation is not enough

From Dev

MQL4 Not enough memory for history data

From Dev

Not enough memory?

From Dev

Read HTML code into R for data & text mining

From Dev

Read HTML code into R for data & text mining

From Dev

Qt qUncompress giving "could not allocate enough memory to uncompress data"

From Dev

How to estimate if the JVM has enough free memory for a particular data structure?

From Dev

Qt qUncompress giving "could not allocate enough memory to uncompress data"

From Dev

Not enough memory to allocate in PHP

From Dev

Not enough memory allocation?

From Dev

Resolving Error R6016 - Not enough space for thread data

From Dev

Big data memory issue in R

From Dev

Data mining small datasets

From Dev

Construction of DecisionTree in data mining

From Dev

Data mining with csv (python)

From Dev

not enough memory to run excel 2013

From Dev

VBA error: not enough memory for the operation

From Dev

Git push - select: Not enough memory

From Dev

Windows 10 install - Not enough memory

From Dev

not enough memory to run excel 2013

From Dev

Not enough memory, setting a picture in a userform

From Dev

Matlab: Not enough GPU memory for classification

From Dev

stemDocument R text mining

From Dev

stemDocument R text mining

From Dev

R Text Mining with quanteda

From Dev

Out of memory error despite having enough memory

From Dev

Storing large NSData in Core Data. Is core data "smart" enough to evoke un-neaded fetch data when there is a memory warning?

Related Related

HotTag

Archive