python gensim : 인덱스 배열에 정수가 아닌 dtype (float64)이 있습니다.

debugcn 에 게시 Dev

qmaruf

이 gensim 튜토리얼을 사용하여 텍스트 간의 유사점을 찾고 있습니다. 다음은 코드입니다.

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

'''
documents = ["Human machine interface for lab abc computer applications",
              "bags loose tea water second ingredient tastes water",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey",
              "red cow butter oil"]
'''
documents = ["Human machine interface for lab abc computer applications",
              "bags loose tea water second ingredient tastes water"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
         for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

#print corpus

tfidf = models.TfidfModel(corpus)

#print tfidf

corpus_tfidf = tfidf[corpus]

#print corpus_tfidf

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lsi.print_topics(1)

lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lda.print_topics(1)

corpora.MmCorpus.serialize('dict.mm', corpus)
corpus = corpora.MmCorpus('dict.mm')
#print corpus

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
#print vec_lsi

index = similarities.MatrixSimilarity(lsi[corpus])
index.save('dict.index')
index = similarities.MatrixSimilarity.load('dict.index')

sims = index[vec_lsi]
#print list(enumerate(sims))

sims = sorted(enumerate(sims),key=lambda item: -item[1])
for sim in sims:
  print documents[sim[0]], " ==> ", sim[1]

여기에 두 개의 문서가 있습니다. 하나는 10 개의 텍스트가 있고 다른 하나는 2 개가 있습니다. 하나는 주석 처리되어 있습니다. 첫 번째 문서 목록을 사용하면 모든 것이 잘 진행되고 의미있는 출력이 생성됩니다. 두 번째 문서 목록 (2 개의 텍스트 포함)을 사용하면 오류가 발생했습니다. 여기있어

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:122: UserWarning: indices array has non-integer dtype (float64)
% self.indices.dtype.name )

이 오류의 원인은 무엇이며 어떻게 해결할 수 있습니까? 64 비트 컴퓨터를 사용하고 있습니다.

스티브 반스

두 번째 목록이 [[], ['water']]싱글 톤을 제거 할 때가 될 것이라는 사실 때문에 발생할 수 있으며 , 차원이 0과 1 인 행렬에 대해 행렬 연산을 시도하면 모든 종류의 문제가 발생할 수 있습니다.

코드를 가지고 놀기 :

>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> corpus
[[], [(0, 2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:23:31,415 : INFO : collecting document frequencies
2013-07-21 09:23:31,415 : INFO : PROGRESS: processing document #0
2013-07-21 09:23:31,415 : INFO : calculating IDF weights for 2 documents and 1 features (1 matrix non-zeros)
>>> corpus = [[(1,)], [(0,2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:24:16,452 : INFO : collecting document frequencies
2013-07-21 09:24:16,452 : INFO : PROGRESS: processing document #0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/tfidfmodel.py", line 96, in __init__
    self.initialize(corpus)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/tfidfmodel.py", line 119, in initialize
    for termid, _ in bow:
ValueError: need more than 1 value to unpack
>>> corpus = [[(1,3)], [(0,2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:24:26,892 : INFO : collecting document frequencies
2013-07-21 09:24:26,892 : INFO : PROGRESS: processing document #0
2013-07-21 09:24:26,892 : INFO : calculating IDF weights for 2 documents and 2 features (2 matrix non-zeros)
>>>

나는 당신이 그 확인해야합니다 위에 말했듯 corpus않습니다 하지 호출하기 전에 빈 목록이 models.TfidfModel(corpus)그것에.

이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.

침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제

에서 수정2021-05-29

몇 마디 만하겠습니다

0리뷰

로그인참여 후 검토

Related 관련 기사

기사

python gensim : 인덱스 배열에 정수가 아닌 dtype (float64)이 있습니다.

python gensim : 인덱스 배열에 정수가 아닌 dtype (float64)이 있습니다.

cut 함수 : 'safe'규칙에 따라 dtype ( 'float64')에서 dtype ( '<U32')으로 배열 데이터를 캐스팅 할 수 없습니다.

'safe'에 따라 dtype ( 'float64')에서 dtype ( 'int32')로 배열 데이터를 캐스팅 할 수 없습니다.

Python은 사전에 추가 할 수 없습니다. TypeError : 목록 인덱스는 str이 아닌 정수 여야합니다.

KeyError : "[Float64Index ([34.62365962451697, 30.28671076822607, 35.84740876993872], dtype = 'float64')]가 [열]에 없습니다."

Pandas pd.read_excel에서 이름이 아닌 인덱스로 열 dtype을 지정하는 방법

Python에서 JSON을 CSV로 변환 : 목록 인덱스는 str이 아닌 정수 여야합니다.

인덱스가있는 포인터로 배열이 아닌 연속 데이터 블록에 액세스 할 수 있습니까?

Python : "TypeError : 목록 인덱스는 튜플이 아닌 정수 여야합니다."

Python 2D 배열 목록 인덱스는 튜플이 아닌 정수 또는 슬라이스 여야합니다.

2d 배열에서 0이 아닌 요소가있는 열의 인덱스 찾기

Numpy 배열-TypeError : 목록 인덱스는 튜플이 아닌 정수 또는 슬라이스 여야합니다.

값이 1보다 큰 반복 인덱스를 허용하여 numpy 배열에서 0이 아닌 값의 인덱스를 가져옵니다.

당신은 객체 상태가 아닌 인덱스 배열에서 개체를 제거 할 수 있습니까?

인덱스가 아닌 배열에서 객체를 제거 할 수 있습니까? (자바)

Scrapy : TypeError : 문자열 인덱스는 str이 아닌 정수 여야합니까?

preg_match_all은 인덱스가 1 인 배열이 아닌 인덱스가 0 인 첫 번째 배열 만 반환합니다.

Scikit-learn SequentialFeatureSelector 입력에 NaN, 무한대 또는 dtype ( 'float64')에 비해 너무 큰 값이 포함되어 있습니다. 파이프 라인도

목록 인덱스는 str error Python이 아닌 정수 또는 슬라이스 여야합니다.

Python TypeError : 목록 인덱스는 튜플이 아닌 정수 또는 슬라이스 여야합니다.

TypeError : 목록 인덱스는 str (Python)이 아닌 정수 또는 슬라이스 여야합니다.

Python JSON TypeError 목록 인덱스는 str이 아닌 정수 또는 슬라이스 여야합니다.

MongoDB에서 'GET'메서드 읽기, TypeError : 문자열 인덱스는 str이 아닌 정수 여야합니다.

Python 목록 루프 오류 : TypeError : 목록 인덱스는 str이 아닌 정수 여야합니다.

DB2에서 기본 키가 아닌 인덱스를 정의하면 성능 이점을 얻을 수 있습니까?

Angular가 인덱스 0 배열을 인덱스 1이 아닌 하위 배열로 푸시하는 이유는 무엇입니까?

숫자가 아닌 항목이있는 고유하지 않은 배열에서 고유 배열의 인덱스 찾기

문자열 AND 정수 인덱스가있는 Python 배열

ValueError : 입력에 NaN, 무한대 또는 dtype ( 'float64')에 비해 너무 큰 값이 있습니다. sklearn

입력에 무한대 또는 dtype ( 'float64') 오류에 비해 너무 큰 값이 있습니다.