您可以在scikit-learn中添加到CountVectorizer吗？

debugcn 发表于 Dev

imichaeldotorg

我想在scikit-learn中基于文本语料库创建CountVectorizer，然后在CountVectorizer中添加更多文本（添加到原始字典中）。

如果使用transform()，它的确会保留原始词汇，但不会添加新词。如果我使用fit_transform()，它只是从头开始重新生成词汇表。见下文：

In [2]: count_vect = CountVectorizer()

In [3]: count_vect.fit_transform(["This is a test"])
Out[3]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

In [4]: count_vect.vocabulary_  
Out[4]: {u'is': 0, u'test': 1, u'this': 2}

In [5]: count_vect.transform(["This not is a test"])
Out[5]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'test': 1, u'this': 2}

In [7]: count_vect.fit_transform(["This not is a test"])
Out[7]: 
<1x4 sparse matrix of type '<type 'numpy.int64'>'
    with 4 stored elements in Compressed Sparse Row format>

In [8]: count_vect.vocabulary_
Out[8]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}

我想要一个等效的update()功能。我希望它能像这样工作：

In [2]: count_vect = CountVectorizer()

In [3]: count_vect.fit_transform(["This is a test"])
Out[3]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

In [4]: count_vect.vocabulary_  
Out[4]: {u'is': 0, u'test': 1, u'this': 2}

In [5]: count_vect.update(["This not is a test"])
Out[5]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 4 stored elements in Compressed Sparse Row format>

In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}

有没有办法做到这一点？

piman314

scikit-learn设计中实现的算法旨在一次适应所有数据，这对于大多数ML算法都是必需的（尽管有趣的是您所描述的应用程序），因此没有update功能。

有一种方法可以通过稍微有所不同的方式来获得所需的内容，请参见以下代码

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
count_vect = CountVectorizer()
count_vect.fit_transform(["This is a test"])
print count_vect.vocabulary_
count_vect.fit_transform(["This is a test", "This is not a test"])
print count_vect.vocabulary_

哪个输出

{u'this': 2, u'test': 1, u'is': 0}
{u'this': 3, u'test': 2, u'is': 0, u'not': 1}

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-06-16

我来说两句

0条评论

登录后参与评论

来自分类Dev

可以在scikit-learn中添加到CountVectorizer吗？

来自分类Dev

您可以将变量添加到字典中吗

来自分类Dev

您可以从scikit-learn中的DecisionTreeRegressor中获取选定的叶子吗

来自分类Dev

您可以将初始数据添加到Docker中可用的卷中吗？

来自分类Dev

您可以将字符串中的字符添加到列表中吗？

来自分类Dev

您可以从自己的文件中将图像添加到内部CSS样式表中吗？

来自分类Dev

您可以将对象添加到其构造函数内部的数组中吗

来自分类Dev

您可以使用ubuntu中的GUI将条目添加到fstab吗？

来自分类Dev

您可以从自己的文件中将图像添加到内部CSS样式表中吗？

来自分类Dev

您可以使用内置类将背景色添加到引导程序中的表吗？

来自分类Dev

您可以在JPanel中添加JButton吗

来自分类Dev

我可以将paintComponents（）添加到数组中吗？

来自分类Dev

我可以将课程添加到链接中吗

来自分类Dev

我可以将 JLabels 添加到数组中吗？

来自分类Dev

图标/图像可以添加到 Sumoselect 中的选项吗？

来自分类Dev

您可以使用类别/扩展名将IBDesignable属性添加到UIView吗？

来自分类Dev

您可以在jQuery中将标头添加到getJSON吗？

来自分类Dev

您可以直接将存档文件添加到XCode Organizer吗？

来自分类Dev

您可以将第N个孩子添加到JS吗？

来自分类Dev

您可以将控制台程序添加到WindowsFormApplication程序吗？

来自分类Dev

您可以将BarTintColor添加到React Native Router吗

来自分类Dev

我知道您可以将HTML添加到RSS 2.0，但是您应该添加整个页面吗？

来自分类Dev

您可以手动将表和对EF Core Code First中的SP /视图/功能的引用添加到现有数据库中吗？

来自分类Dev

您可以将一个事件中心规则添加到共享访问策略中的数量不止一个吗？

来自分类Dev

R和DT-您可以将行子标题和组添加到数据表吗？

来自分类Dev

您可以向操作添加参数吗？

来自分类Dev

您可以在创建后向承诺添加 .then 吗？

来自分类Dev

您可以安装UEFI吗？

来自分类Dev

您可以连锁物业吗？

Related 相关文章

文章