Equivalent of R's removeSparseTerms in Python

AnirudhJ Published at Dev

AnirudhJ

We are working on a data mining project and have used the removeSparseTerms function in the tm package in R for reducing the features of our document term matrix.

However, we are looking to port the code to python. Is there a function in sklearn, nltk or some other package which can give the same functionality?

Thanks!

omerbp

If your data is plain text, you can use CountVectorizer in order to get this job done.

For example:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_ 
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)

Now X is the document-term matrix. (If you are into information retrieval you want to consider also Tf–idf term weighting.

It can help you get the document-term matrix easily with a few lines.

Regarding the sparsity - you can control these parameters:

min_df - the minimum document frequency allowed for a term in the document-term matrix.
max_features - the maximum number of features allowed in the document-term matrix

Alternatively, If you already have the document-term matrix or Tf-idf matrix, and you have the notion of what is sparse, define MIN_VAL_ALLOWED, and then do the following:

import numpy as np
from scipy.sparse import csr_matrix
MIN_VAL_ALLOWED = 2

X = csr_matrix([[7,8,0],
                [2,1,1],
                [5,5,0]])

z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_VAL_ALLOWED)) #z is the non-sparse terms 

print X[:,z].toarray()
#prints X without the third term (as it is sparse)
[[7 8]
[2 1]
[5 5]]

(use X = X[:,z] so X remains a csr_matrix.)

If it is the minimum document frequency you wish to set a threshold on, binarize the matrix first, and than use it the same way:

import numpy as np
from scipy.sparse import csr_matrix

MIN_DF_ALLOWED = 2

X = csr_matrix([[7, 1.3, 0.9, 0],
                [2, 1.2, 0.8  , 1],
                [5, 1.5, 0  , 0]])

#Creating a copy of the data
B = csr_matrix(X, copy=True)
B[B>0] = 1
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_DF_ALLOWED))
print  X[:,z].toarray()
#prints
[[ 7.   1.3]
[ 2.   1.2]
[ 5.   1.5]]

In this example, the third and fourth term (or columns) are gone, since they only appear in two documents (rows). Use MIN_DF_ALLOWED to set the threshold.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-20

Comments

0 comments

From Dev

Related Related

Article

Equivalent of R's removeSparseTerms in Python

Equivalent of R's removeSparseTerms in Python

Equivalent of R's createDataPartition in Python

How does the removeSparseTerms in R work?

equivalent of R's View for Python's pandas

What is python's equivalent of R's NA?

Python's equivalent for R's dput() function

Is there a Python equivalent to R's sample() function?

Python equivalent for R's 'zoo' package

R's read.table equivalent in Python

Python equivalent of R's head and tail function

Python equivalent of R's rnbinom parametrized with mu

simplest python equivalent to R's grepl

equivalent to R's `do.call` in python

simplest python equivalent to R's gsub

Python equivalent to R 's factor data type

simplest python equivalent to R's grepl

Equivalent of R's sapply with a condition In Python

Equivalent of R's paste command for vector of numbers in Python

Is there a Python equivalent of R's str(), returning only the structure of an object?

R's which() and which.min() Equivalent in Python

Is there a python equivalent for R's h2o.stack?

Java equivalent to python's "with"

Equivalent of Python's 'with' in Julia?

Equivalent of Python's locals()?

python equivalent of R table

Python equivalent of the R operator "%in%"

Equivalent of source() of R in Python

Equivalent of "table" of R in python

Python's equivalent of Ruby's ||=

What's the equivalent of `cons` in R?