Equivalent of R's removeSparseTerms in Python

AnirudhJ

We are working on a data mining project and have used the removeSparseTerms function in the tm package in R for reducing the features of our document term matrix.

However, we are looking to port the code to python. Is there a function in sklearn, nltk or some other package which can give the same functionality?

Thanks!

omerbp

If your data is plain text, you can use CountVectorizer in order to get this job done.

For example:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_ 
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)

Now X is the document-term matrix. (If you are into information retrieval you want to consider also Tf–idf term weighting.

It can help you get the document-term matrix easily with a few lines.

Regarding the sparsity - you can control these parameters:

  • min_df - the minimum document frequency allowed for a term in the document-term matrix.
  • max_features - the maximum number of features allowed in the document-term matrix

Alternatively, If you already have the document-term matrix or Tf-idf matrix, and you have the notion of what is sparse, define MIN_VAL_ALLOWED, and then do the following:

import numpy as np
from scipy.sparse import csr_matrix
MIN_VAL_ALLOWED = 2

X = csr_matrix([[7,8,0],
                [2,1,1],
                [5,5,0]])

z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_VAL_ALLOWED)) #z is the non-sparse terms 

print X[:,z].toarray()
#prints X without the third term (as it is sparse)
[[7 8]
[2 1]
[5 5]]

(use X = X[:,z] so X remains a csr_matrix.)

If it is the minimum document frequency you wish to set a threshold on, binarize the matrix first, and than use it the same way:

import numpy as np
from scipy.sparse import csr_matrix

MIN_DF_ALLOWED = 2

X = csr_matrix([[7, 1.3, 0.9, 0],
                [2, 1.2, 0.8  , 1],
                [5, 1.5, 0  , 0]])

#Creating a copy of the data
B = csr_matrix(X, copy=True)
B[B>0] = 1
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_DF_ALLOWED))
print  X[:,z].toarray()
#prints
[[ 7.   1.3]
[ 2.   1.2]
[ 5.   1.5]]

In this example, the third and fourth term (or columns) are gone, since they only appear in two documents (rows). Use MIN_DF_ALLOWED to set the threshold.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Equivalent of R's createDataPartition in Python

From Dev

How does the removeSparseTerms in R work?

From Dev

equivalent of R's View for Python's pandas

From Dev

What is python's equivalent of R's NA?

From Dev

Python's equivalent for R's dput() function

From Dev

Is there a Python equivalent to R's sample() function?

From Dev

Python equivalent for R's 'zoo' package

From Dev

R's read.table equivalent in Python

From Dev

Python equivalent of R's head and tail function

From Dev

Python equivalent of R's rnbinom parametrized with mu

From Dev

simplest python equivalent to R's grepl

From Dev

equivalent to R's `do.call` in python

From Dev

simplest python equivalent to R's gsub

From Dev

Python equivalent to R 's factor data type

From Dev

simplest python equivalent to R's grepl

From Dev

Equivalent of R's sapply with a condition In Python

From Dev

Equivalent of R's paste command for vector of numbers in Python

From Dev

Is there a Python equivalent of R's str(), returning only the structure of an object?

From Dev

R's which() and which.min() Equivalent in Python

From Dev

Is there a python equivalent for R's h2o.stack?

From Dev

Java equivalent to python's "with"

From Dev

Equivalent of Python's 'with' in Julia?

From Dev

Equivalent of Python's locals()?

From Java

python equivalent of R table

From Dev

Python equivalent of the R operator "%in%"

From Dev

Equivalent of source() of R in Python

From Dev

Equivalent of "table" of R in python

From Dev

Python's equivalent of Ruby's ||=

From Dev

What's the equivalent of `cons` in R?