We are working on a data mining project and have used the removeSparseTerms function in the tm package in R for reducing the features of our document term matrix.
However, we are looking to port the code to python. Is there a function in sklearn, nltk or some other package which can give the same functionality?
Thanks!
If your data is plain text, you can use CountVectorizer in order to get this job done.
For example:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)
Now X
is the document-term matrix. (If you are into information retrieval you want to consider also Tf–idf term weighting.
It can help you get the document-term matrix easily with a few lines.
Regarding the sparsity - you can control these parameters:
Alternatively, If you already have the document-term matrix or Tf-idf matrix, and you have the notion of what is sparse, define MIN_VAL_ALLOWED
, and then do the following:
import numpy as np
from scipy.sparse import csr_matrix
MIN_VAL_ALLOWED = 2
X = csr_matrix([[7,8,0],
[2,1,1],
[5,5,0]])
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_VAL_ALLOWED)) #z is the non-sparse terms
print X[:,z].toarray()
#prints X without the third term (as it is sparse)
[[7 8]
[2 1]
[5 5]]
(use X = X[:,z]
so X
remains a csr_matrix
.)
If it is the minimum document frequency you wish to set a threshold on, binarize the matrix first, and than use it the same way:
import numpy as np
from scipy.sparse import csr_matrix
MIN_DF_ALLOWED = 2
X = csr_matrix([[7, 1.3, 0.9, 0],
[2, 1.2, 0.8 , 1],
[5, 1.5, 0 , 0]])
#Creating a copy of the data
B = csr_matrix(X, copy=True)
B[B>0] = 1
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_DF_ALLOWED))
print X[:,z].toarray()
#prints
[[ 7. 1.3]
[ 2. 1.2]
[ 5. 1.5]]
In this example, the third and fourth term (or columns) are gone, since they only appear in two documents (rows). Use MIN_DF_ALLOWED
to set the threshold.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments