Equivalent of R's removeSparseTerms in Python


We are working on a data mining project and have used the removeSparseTerms function in the tm package in R for reducing the features of our document term matrix.

However, we are looking to port the code to python. Is there a function in sklearn, nltk or some other package which can give the same functionality?



If your data is plain text, you can use CountVectorizer in order to get this job done.

For example:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_ 
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)

Now X is the document-term matrix. (If you are into information retrieval you want to consider also Tf–idf term weighting.

It can help you get the document-term matrix easily with a few lines.

Regarding the sparsity - you can control these parameters:

  • min_df - the minimum document frequency allowed for a term in the document-term matrix.
  • max_features - the maximum number of features allowed in the document-term matrix

Alternatively, If you already have the document-term matrix or Tf-idf matrix, and you have the notion of what is sparse, define MIN_VAL_ALLOWED, and then do the following:

import numpy as np
from scipy.sparse import csr_matrix

X = csr_matrix([[7,8,0],

z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_VAL_ALLOWED)) #z is the non-sparse terms 

print X[:,z].toarray()
#prints X without the third term (as it is sparse)
[[7 8]
[2 1]
[5 5]]

(use X = X[:,z] so X remains a csr_matrix.)

If it is the minimum document frequency you wish to set a threshold on, binarize the matrix first, and than use it the same way:

import numpy as np
from scipy.sparse import csr_matrix


X = csr_matrix([[7, 1.3, 0.9, 0],
                [2, 1.2, 0.8  , 1],
                [5, 1.5, 0  , 0]])

#Creating a copy of the data
B = csr_matrix(X, copy=True)
B[B>0] = 1
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_DF_ALLOWED))
print  X[:,z].toarray()
[[ 7.   1.3]
[ 2.   1.2]
[ 5.   1.5]]

In this example, the third and fourth term (or columns) are gone, since they only appear in two documents (rows). Use MIN_DF_ALLOWED to set the threshold.

