countVectorizerを使用して、Pythonで自分の語彙の単語の出現を計算します

debugcn 投稿 Dev

ナイトレイン

Doc1: ['And that was the fallacy. Once I was free to talk with staff members']

Doc2: ['In the new, stripped-down, every-job-counts business climate, these human']

Doc3 : ['Another reality makes emotional intelligence ever more crucial']

Doc4: ['The globalization of the workforce puts a particular premium on emotional']

Doc5: ['As business changes, so do the traits needed to excel. Data tracking']

これは私の語彙のサンプルです：

my_vocabulary= [‘was the fallacy’, ‘free to’, ‘stripped-down’, ‘ever more’, ‘of the workforce’, ‘the traits needed’]

重要なのは、私の語彙のすべての単語がバイグラムまたはトリグラムであるということです。私の語彙には、ドキュメントセットに含まれる可能性のあるすべてのバイグラムとトリグラムが含まれています。ここでサンプルを示しました。アプリケーションに基づいて、これは私の語彙がどうあるべきかです。私は次のようにcountVectorizerを使用しようとしています：

from sklearn.feature_extraction.text import CountVectorizer
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]
vectorizer = CountVectorizer( vocabulary=my_vocabulary)
tf = vectorizer.fit_transform(doc_set)

私はこのようなものを手に入れることを期待しています：

print tf:
(0, 126)    1
(0, 6804)   1
(0, 5619)   1
(0, 5019)   2
(0, 5012)   1
(0, 999)    1
(0, 996)    1
(0, 4756)   4

ここで、最初の列はドキュメントID、2番目の列は語彙の単語ID、3番目の列はそのドキュメント内のその単語の出現番号です。しかし、tfは空です。一日の終わりに、語彙のすべての単語を調べて出現を計算し、マトリックスを作成するコードを書くことができますが、私が持っているこの入力にcountVectorizerを使用して時間を節約できますか？私はここで何か間違ったことをしていますか？countVectorizerがそれを行う正しい方法ではない場合は、任意の推奨事項をいただければ幸いです。

KRKirov

CountVectorizerでngram_rangeパラメーターを指定することにより、可能なすべてのバイグラムとトライグラムの語彙を構築できます。fit_tranformの後、get_feature_names（）メソッドとtoarray（）メソッドを使用して語彙と頻度を表示できます。後者は、各ドキュメントの頻度マトリックスを返します。詳細情報：http：//scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

from sklearn.feature_extraction.text import CountVectorizer

Doc1 = 'And that was the fallacy. Once I was free to talk with staff members'
Doc2 = 'In the new, stripped-down, every-job-counts business climate, these human'
Doc3 = 'Another reality makes emotional intelligence ever more crucial'
Doc4 = 'The globalization of the workforce puts a particular premium on emotional'
Doc5 = 'As business changes, so do the traits needed to excel. Data tracking'
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]

vectorizer = CountVectorizer(ngram_range=(2, 3))
tf = vectorizer.fit_transform(doc_set)
vectorizer.vocabulary_
vectorizer.get_feature_names()
tf.toarray()

あなたがやろうとしたことに関しては、語彙でCountVectorizerをトレーニングしてから、ドキュメントを変換すればうまくいくでしょう。

my_vocabulary= ['was the fallacy', 'more crucial', 'particular premium', 'to excel', 'data tracking', 'another reality']

vectorizer = CountVectorizer(ngram_range=(2, 3))
vectorizer.fit_transform(my_vocabulary)
tf = vectorizer.transform(doc_set)

vectorizer.vocabulary_
Out[26]: 
{'another reality': 0,
 'data tracking': 1,
 'more crucial': 2,
 'particular premium': 3,
 'the fallacy': 4,
 'to excel': 5,
 'was the': 6,
 'was the fallacy': 7}

tf.toarray()
Out[25]: 
array([[0, 0, 0, 0, 1, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0]], dtype=int64)

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-1

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

countVectorizerを使用して、Pythonで自分の語彙の単語の出現を計算します

countVectorizerを使用して、Pythonで自分の語彙の単語の出現を計算します

単語の語彙にスペースを提供して、CountVectorizerを学習します

Sci-KitのCountVectorizerを使用して、語彙の正確な単語のみに一致するように入力を変換します

`awk`を使用して、一意の単語、出現回数の合計、合計を出力します

CountVectorizerは空の語彙エラーを出しますドキュメントは基数です

SQLファイルを読み込み、CountVectorizerを使用して単語の出現を取得する

Pythonで正規表現の単語境界を使用して単語形式の文字列を抽出します

awkを使用して単語を照合し、別の単語の最初の出現を見つけます

Pigを使用して、各行の単語の出現をカウントします

Python、辞書とセットを使用して単語のすべての出現を追跡します

nltkを使用してPythonのcsvファイル内の単語の出現をカウントします

sedを使用して、複数回出現する単語をリストの単語に置き換えます

WebページからHTMLを取り除き、単語の出現頻度を計算しますか？

Rの複数単語の単語頻度を計算しますか？

大きなテキストファイルで指定した単語の出現回数を計算する

正規表現を使用して、文内の単語の最後の出現を検索します

正規表現pythonを使用して特定の単語を削除します

Pythonは正規表現を使用して大文字の単語を抽出します

Pythonを使用して文字列内の単語のすべての出現をカウントする方法

Pythonを使用して、リスト内の単語の文字列で単語数を取得します

Pythonは、正規表現を使用して特定の単語リストの前に3単語、後に3単語を抽出します。

文字列内の単語の出現を検索します

配列（単語）内のすべての出現（文字）を検索します

quantedaを使用して、用語固有の用語と逆用語frqを計算します

unordered_mapを使用してハッシュ関数を作成し、単語の出現を取得します

Pythonを使用して文の単語を逆にしますか？

Rを使用して、列内の単語リストの出現を効率的にカウントします。

c＃を使用してxml内の単語の出現数をカウントします

Python-リスト内の単語の出現をカウントします

CountVectorizerは語彙を印刷しません