How does the removeSparseTerms in R work?

London guy Published at Dev

London guy

I am using the removeSparseTerms method in R and it required a threshold value to be input. I also read that the higher the value, the more will be the number of terms retained in the returned matrix.

How does this method work and what is the logic behind it? I understand the concept of sparseness but does this threshold indicate how many documents should a term be present it, or some other ratio, etc?

Ken Benoit

In the sense of the sparse argument to removeSparseTerms(), sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed. Relative document frequency here means a proportion. As the help page for the command states (although not very clearly), sparsity is smaller as it approaches 1.0. (Note that sparsity cannot take values of 0 or 1.0, only values in between.)

For example, if you set sparse = 0.99 as the argument to removeSparseTerms(), then this will remove only terms that are more sparse than 0.99. The exact interpretation for sparse = 0.99 is that for term $j$, you will retain all terms for which $df_j > N * (1 - 0.99)$, where $N$ is the number of documents -- in this case probably all terms will be retained (see example below).

Near the other extreme, if sparse = .01, then only terms that appear in (nearly) every document will be retained. (Of course this depends on the number of terms and the number of documents, and in natural language, common words like "the" are likely to occur in every document and hence never be "sparse".)

An example of the sparsity threshold of 0.99, where a term that occurs at most in (first example) less than 0.01 documents, and (second example) just over 0.01 documents:

> # second term occurs in just 1 of 101 documents
> myTdm1 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,1), rep(0, 100)), ncol=2)), 
+                                weighting = weightTf)
> removeSparseTerms(myTdm1, .99)
<<DocumentTermMatrix (documents: 101, terms: 1)>>
Non-/sparse entries: 101/0
Sparsity           : 0%
Maximal term length: 2
Weighting          : term frequency (tf)
> 
> # second term occurs in 2 of 101 documents
> myTdm2 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,2), rep(0, 99)), ncol=2)), 
+                                weighting = weightTf)
> removeSparseTerms(myTdm2, .99)
<<DocumentTermMatrix (documents: 101, terms: 2)>>
Non-/sparse entries: 103/99
Sparsity           : 49%
Maximal term length: 2
Weighting          : term frequency (tf)

Here are a few additional examples with actual text and terms:

> myText <- c("the quick brown furry fox jumped over a second furry brown fox",
              "the sparse brown furry matrix",
              "the quick matrix")

> require(tm)
> myVCorpus <- VCorpus(VectorSource(myText))
> myTdm <- DocumentTermMatrix(myVCorpus)
> as.matrix(myTdm)
    Terms
Docs brown fox furry jumped matrix over quick second sparse the
   1     2   2     2      1      0    1     1      1      0   1
   2     1   0     1      0      1    0     0      0      1   1
   3     0   0     0      0      1    0     1      0      0   1
> as.matrix(removeSparseTerms(myTdm, .01))
    Terms
Docs the
   1   1
   2   1
   3   1
> as.matrix(removeSparseTerms(myTdm, .99))
    Terms
Docs brown fox furry jumped matrix over quick second sparse the
   1     2   2     2      1      0    1     1      1      0   1
   2     1   0     1      0      1    0     0      0      1   1
   3     0   0     0      0      1    0     1      0      0   1
> as.matrix(removeSparseTerms(myTdm, .5))
    Terms
Docs brown furry matrix quick the
   1     2     2      0     1   1
   2     1     1      1     0   1
   3     0     0      1     1   1

In the last example with sparse = 0.34, only terms occurring in two-thirds of the documents were retained.

An alternative approach for trimming terms from document-term matrixes based on a document frequency is the text analysis package quanteda. The same functionality here refers not to sparsity but rather directly to the document frequency of terms (as in tf-idf).

> require(quanteda)
> myDfm <- dfm(myText, verbose = FALSE)
> docfreq(myDfm)
     a  brown    fox  furry jumped matrix   over  quick second sparse    the 
     1      2      1      2      1      2      1      2      1      1      3 
> dfm_trim(myDfm, minDoc = 2)
Features occurring in fewer than 2 documents: 6 
Document-feature matrix of: 3 documents, 5 features.
3 x 5 sparse Matrix of class "dfmSparse"
       features
docs    brown furry the matrix quick
  text1     2     2   1      0     1
  text2     1     1   1      1     0
  text3     0     0   1      1     1

This usage seems much more straightforward to me.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-16

Comments

0 comments

From Dev

Related Related

Article

How does the removeSparseTerms in R work?

How does the removeSparseTerms in R work?

Equivalent of R's removeSparseTerms in Python

How does the '|' operator work in R?

How does the `[<-` function work in R?

how does [[ ]] work in for loop in r?

How does this R code work?

How does character comparison work in R?

how does \r (carriage return) work in Python

R, How does while (TRUE) work?

How does the command "plot(qnorm)" work in R?

In R, how does group_by in dplyr work?

How does cut with breaks work in R

R, How does while (TRUE) work?

In R, how does double percent work when used on a matrix?

How to create a js checkbox in flexdashboard? shinyTree does not work (R, shiny)

How does the sgolay function work in Matlab R2013a?

How does the function argument work in R's 'combn'?

How does R's system.time work?

How does the output of PHP functions work using print_r?

How does the function argument work in R's 'combn'?

how does this fork() work

How does a WebHook work?

How does CocoaPods work

How does GetLastActivePopup work?

how does the groovyc work?

How does cin work?

How does `[]=` work in Ruby?

How does decltype work?

How does this ❤ code work?

how does isNan( ) work?