在Keras模型中使用Tf-Idf

莫甘博

我已经将train,test和validation句子读入train_sentences,test_sentences,val_sentences中

然后我在这些上应用了Tf-IDF矢量化器。

vectorizer = TfidfVectorizer(max_features=300)
vectorizer = vectorizer.fit(train_sentences)

X_train = vectorizer.transform(train_sentences)
X_val = vectorizer.transform(val_sentences)
X_test = vectorizer.transform(test_sentences)

我的模型看起来像这样

model = Sequential()

model.add(Input(????))

model.add(Flatten())

model.add(Dense(256, activation='relu'))

model.add(Dense(32, activation='relu'))

model.add(Dense(8, activation='sigmoid'))

model.summary()

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

通常,在word2vec的情况下,我们在嵌入层中传递嵌入矩阵。

如何在Keras模型中使用Tf-IDF?请提供一个示例供我使用。

谢谢。

马蒂亚斯·穆勒(Mathias Mueller)

我无法想象将TF / IDF值与嵌入向量相结合的充分理由,但是这里有一个可能的解决方案:使用函数API,multipleInputconcatenate函数。

要串联图层输出,必须对齐其形状(除了要串联的轴)。一种方法是平均嵌入,然后连接到TF / IDF值的向量。

设置以及一些示例数据

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.datasets import fetch_20newsgroups

import numpy as np

import keras

from keras.models import Model
from keras.layers import Dense, Activation, concatenate, Embedding, Input

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# some sample training data
bunch = fetch_20newsgroups()
all_sentences = []

for document in bunch.data:
  sentences = document.split("\n")
  all_sentences.extend(sentences)

all_sentences = all_sentences[:1000]

X_train, X_test = train_test_split(all_sentences, test_size=0.1)
len(X_train), len(X_test)

vectorizer = TfidfVectorizer(max_features=300)
vectorizer = vectorizer.fit(X_train)

df_train = vectorizer.transform(X_train)

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

maxlen = 50

sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_train = pad_sequences(sequences_train, maxlen=maxlen)

型号定义

vocab_size = len(tokenizer.word_index) + 1
embedding_size = 300

input_tfidf = Input(shape=(300,))
input_text = Input(shape=(maxlen,))

embedding = Embedding(vocab_size, embedding_size, input_length=maxlen)(input_text)

# this averaging method taken from:
# https://stackoverflow.com/a/54217709/1987598

mean_embedding = keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1))(embedding)

concatenated = concatenate([input_tfidf, mean_embedding])

dense1 = Dense(256, activation='relu')(concatenated)
dense2 = Dense(32, activation='relu')(dense1)
dense3 = Dense(8, activation='sigmoid')(dense2)

model = Model(inputs=[input_tfidf, input_text], outputs=dense3)

model.summary()

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

模型摘要输出

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_11 (InputLayer)           (None, 50)           0                                            
__________________________________________________________________________________________________
embedding_5 (Embedding)         (None, 50, 300)      633900      input_11[0][0]                   
__________________________________________________________________________________________________
input_10 (InputLayer)           (None, 300)          0                                            
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 300)          0           embedding_5[0][0]                
__________________________________________________________________________________________________
concatenate_4 (Concatenate)     (None, 600)          0           input_10[0][0]                   
                                                                 lambda_1[0][0]                   
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 256)          153856      concatenate_4[0][0]              
__________________________________________________________________________________________________
dense_6 (Dense)                 (None, 32)           8224        dense_5[0][0]                    
__________________________________________________________________________________________________
dense_7 (Dense)                 (None, 8)            264         dense_6[0][0]                    
==================================================================================================
Total params: 796,244
Trainable params: 796,244
Non-trainable params: 0

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章

来自分类Dev

TF IDF分数“错误”

来自分类Dev

如何使用Tf-idf功能来训练模型?

来自分类Dev

将Tf-idf用作CNN模型中的功能

来自分类Dev

gensim的LSA模型用的是tf-idf的哪个公式?

来自分类Dev

如何使TF-IDF矩阵密集?

来自分类Dev

NLTK是否已实施TF-IDF?

来自分类Dev

Greemlin中的TF-IDF算法

来自分类Dev

TF / IDF可以考虑上课吗

来自分类Dev

如何计算查询的TF-IDF?

来自分类Dev

带TF-IDF变换的线性回归

来自分类Dev

TF-IDF分数计算示例

来自分类Dev

如何计算查询的TF-IDF?

来自分类Dev

在TF-IDF矩阵上计算SVD

来自分类Dev

使用SVM时是否需要TF-IDF?

来自分类Dev

使用TF-IDF和余弦相似度的匹配短语

来自分类Dev

使用TF-IDF在K均值中绘制质心

来自分类Dev

TF-IDF和非TF-IDF功能的准确性

来自分类Dev

如何在实践中使用我的文本分类器?截至获得新评论的tf-idf值

来自分类Dev

如何为要分类的单个新文档计算TF * IDF?

来自分类Dev

tf-idf权重会影响余弦相似度吗?

来自分类Dev

scikit-learn TfidfVectorizer如何计算TF-IDF

来自分类Dev

如何为字典列表计算tf-idf?

来自分类Dev

Python,TF-IDF中的常规矩阵计算

来自分类Dev

如何从sklearn中的TF * IDF值获取字数

来自分类Dev

从TF-IDF到Spark,Pyspark中的LDA群集

来自分类Dev

通常,TF-IDF何时会降低准确性?

来自分类Dev

将每个文档转换为基于TF -IDF的向量

来自分类Dev

在TF-IDF中结合Unigram和Bigram

来自分类Dev

如何获得最重要单词的TF-IDF分数?