Given the following list of documents:
docs = [
'feature one`feature two`feature three',
'feature one`feature two`feature four',
'feature one'
]
I want to use either of the vectorizer classes in scikit (CountVectorizer
or TfidfVectorizer
), with 'feature one'
, 'feature two'
, 'feature three'
, and 'feature four'
should be the four features represented in the matrix.
I tried this:
vec = CountVectorizer(token_pattern='(?u)\w+\s.\w.`')
But that returns only this:
['feature one`', 'feature two`']
In [295]: vec = CountVectorizer(token_pattern='(?u)\w+[\s\`]\w+')
In [296]: X = vec.fit_transform(docs)
In [297]: vec.get_feature_names()
Out[297]: ['feature four', 'feature one', 'feature three', 'feature two']
you may also want to consider using ngram_range=(2,2)
, which would produce the following:
In [308]: vec = CountVectorizer(ngram_range=(2,2))
In [309]: X = vec.fit_transform(docs)
In [310]: vec.get_feature_names()
Out[310]:
['feature four',
'feature one',
'feature three',
'feature two',
'one feature',
'two feature']
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加