Gensim 中的 TFIDIF 模型创建类型错误

Question

类型错误：'TfidfModel'对象不可调用

为什么初始化后无法计算每个Doc的TFIDF矩阵？

我从 999 文档开始：999 个段落，每个段落大约有 5-15 个句子。在 spaCy 对所有内容进行标记后，我创建了 dictionary（~16k 唯一标记）和 corpus（元组列表的列表）

现在我准备为一些 ML 创建 tfidf 矩阵（以及后来的 LDA 和 w2V 矩阵）；然而，在用我的语料库初始化 tfidf 模型之后（用于计算 'IDF'） tfidf = models.TfidfModel(corpus) 我在尝试查看每个文档的 tfidf 时收到以下错误消息 tfidf(corpus[5]) TypeError: 'TfidfModel' 对象不可调用

我可以使用不同的语料库创建这个模型，我有四个文档，每个文档只包含一个句子。在那里我可以确认预期的语料库格式是元组列表的列表： [doc1[(word1, count),(word2, count),...], doc2[(word3, count),(word4,count),...]...]

from gensim import corpora, models, similarities

texts = [['teenager', 'martha', 'moxley'...], ['ok','like','kris','usual',...]...]
dictionary = corpora.Dictionary(texts)
>>> Dictionary(15937 unique tokens: ['teenager', 'martha', 'moxley']...)

corpus = [dictionary.doc2bow(text) for text in texts]
>>> [[(0, 2),(1, 2),(2, 1)...],[(3, 1),(4, 1)...]...]

tfidf = models.TfidfModel(corpus)
>>> TfidfModel(num_docs=999, num_nnz=86642)

tfidf(corpus[0])
>>> TypeError: 'TfidfModel' object is not callable

corpus[0]
>>> [(0, 2),(1, 2),(2, 1)...]

print(type(corpus),type(corpus[1]),type(corpus[1][3]))
>>> <class 'list'> <class 'list'> <class 'tuple'>

Answer 1

而不是：tfidf(corpus[0])

尝试：tfidf[corpus[0]]

Answer 2

扩展@whs2k的答案，方括号语法用于在语料库周围形成转换包装器，形成一种惰性处理管道。

直到我阅读了本教程中的说明，我才明白：https://radimrehurek.com/gensim/tut2.html

Calling model[corpus] only creates a wrapper around the old corpus document stream – actual conversions are done on-the-fly, during document iteration. We cannot convert the entire corpus at the time of calling corpus_transformed = model[corpus], because that would mean storing the result in main memory, and that contradicts gensim’s objective of memory-indepedence. If you will be iterating over the transformed corpus_transformed multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.

但我仍然觉得我没有完全理解潜在的 Python 列表魔法。

Gensim 中的 TFIDIF 模型创建类型错误

TFIDIF Model Creation TypeError in Gensim

python

language-features

nlp

tf-idf

gensim