Custom tokenizer for scikit-learn vectorizers

blacksite

Given the following list of documents:

docs = [
'feature one`feature two`feature three',
'feature one`feature two`feature four',
'feature one'
]

I want to use either of the vectorizer classes in scikit (CountVectorizer or TfidfVectorizer), with 'feature one', 'feature two', 'feature three', and 'feature four' should be the four features represented in the matrix.

I tried this:

vec = CountVectorizer(token_pattern='(?u)\w+\s.\w.`')

But that returns only this:

['feature one`', 'feature two`']
MaxU
In [295]: vec = CountVectorizer(token_pattern='(?u)\w+[\s\`]\w+')

In [296]: X = vec.fit_transform(docs)

In [297]: vec.get_feature_names()
Out[297]: ['feature four', 'feature one', 'feature three', 'feature two']

you may also want to consider using ngram_range=(2,2), which would produce the following:

In [308]: vec = CountVectorizer(ngram_range=(2,2))

In [309]: X = vec.fit_transform(docs)

In [310]: vec.get_feature_names()
Out[310]:
['feature four',
 'feature one',
 'feature three',
 'feature two',
 'one feature',
 'two feature']

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集
0

コメントを追加

0

関連記事

分類Dev

scikit learn documentation in PDF

分類Dev

Scikit-Learn Standard Scaler

分類Dev

repeated FeatureUnion in scikit-learn

分類Dev

scikit-learn StratifiedKFold implementation

分類Dev

scikit-learn:最近傍

分類Dev

scikit learnのRandomForestClassifierとExtraTreesClassifier

分類Dev

Scikit-learn tutorial documentation location

分類Dev

Scikit learn split train test for series

分類Dev

Data not persistent in scikit-learn transformers

分類Dev

Balanced Random Forest in scikit-learn (python)

分類Dev

Scikit-Learn Agglomerative Clustering Connectivity Matrix

分類Dev

「KeyError:0」、xgboost、scikit-learn、pandas

分類Dev

Looping scikit-learn machine learning datasets

分類Dev

Scikit-learn tfidf vectorizer in minibatches?

分類Dev

Target transformation and feature selection in scikit-learn

分類Dev

Installing an old version of scikit-learn

分類Dev

anaconda/spyder scikit learn update 0.21.3 to 0.22.2

分類Dev

API calls from NLTK, Gensim, Scikit Learn

分類Dev

StratifiedKFold vs KFold in scikit-learn

分類Dev

StratifiedKFoldとscikit-learnのKFold

分類Dev

Scikit-Learn Not Properly Updating in IPython

分類Dev

scikit-learn Ridge Regression UnboundLocalError

分類Dev

Predict movie reviews with scikit-learn

分類Dev

scikit-learn HashingVectorizer on sparse matrix

分類Dev

How do you override Google AI platform's standard library's (i.e upgrade scikit-learn) and install other libraries for custom prediction routines?

分類Dev

SciKit Learn、Keras、またはPytorchの違い

分類Dev

Scikit Learn K-MeansによるBlaze

分類Dev

scikit-learnとsklearnの違い

分類Dev

dictorを渡してscikit learn estimatorに

Related 関連記事

ホットタグ

アーカイブ