gensim TfidfModel 的默认 smartirs 是什么？

Question

使用gensim:

from gensim.models import TfidfModel
from gensim.corpora import Dictionary

sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()

dataset = [sent0, sent1]
vocab = Dictionary(dataset)
corpus = [vocab.doc2bow(sent) for sent in dataset] 
model = TfidfModel(corpus)

# To retrieve the same pd.DataFrame format.
documents_tfidf_lol = [{vocab[word_idx]:tfidf_value for word_idx, tfidf_value in sent} for sent in model[corpus]]
documents_tfidf = pd.DataFrame(documents_tfidf_lol)
documents_tfidf.fillna(0, inplace=True)

documents_tfidf

[输出]:

    dog mr  quick
0   0.707107    0.0 0.707107
1   0.000000    1.0 0.000000

如果我们手动进行 TF-IDF 计算，

sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()

documents = pd.DataFrame.from_dict(list(map(Counter, [sent0, sent1])))
documents.fillna(0, inplace=True, downcast='infer')
documents = documents.apply(lambda x: x/sum(x))  # Normalize the TF.
documents.head()

# To compute the IDF for all words.
num_sentences, num_words = documents.shape

idf_vector = [] # Lets save an ordered list of IDFS w.r.t. order of the column names.

for word in documents:
  word_idf = math.log(num_sentences/len(documents[word].nonzero()[0]))
  idf_vector.append(word_idf)

# Compute the TF-IDF table.
documents_tfidf = pd.DataFrame(documents.as_matrix() * np.array(idf_vector), 
                               columns=list(documents))
documents_tfidf

[输出]:

    .   brown   dog fox jumps   lazy    mr  over    quick   the
0   0.0 0.0 0.693147    0.0 0.0 0.0 0.000000    0.0 0.693147    0.0
1   0.0 0.0 0.000000    0.0 0.0 0.0 0.693147    0.0 0.000000    0.0

如果我们使用 math.log2 而不是 math.log:

    .   brown   dog fox jumps   lazy    mr  over    quick   the
0   0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1   0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

看起来像gensim:

从 TF-IDF 模型中删除非显着词，当我们 print(model[corpus])
可能日志基础似乎与 log_2
也许正在进行一些正常化。

查看 https://radimrehurek.com/gensim/models/tfidfmodel.html#gensim.models.tfidfmodel.TfidfModel ， smart 方案差异会输出不同的值，但在文档中不清楚默认值是什么。

gensim TfidfModel 默认的 smartirs 是什么？

导致本机实现的 TF-IDF 和 gensim 之间存在差异的其他默认参数是什么？

Answer 1

smartirs的默认值是None，但是如果按照代码，等于ntc。

但是如何呢？

首先，当您调用 model = TfidfModel(corpus) 时，它会使用名为 wglobal 的函数计算语料库的 IDF，该函数在文档中解释为：

wglobal为全局加权函数，默认值为df2idf()。 df2idf 是一个计算具有给定文档频率的术语的 IDF 的函数。 df2idf 的默认参数和公式是：

df2idf(docfreq, totaldocs, log_base=2.0, add=0.0)

实现为：

idfs = add + np.log(float(totaldocs) / docfreq) / np.log(log_base)

其中一个smartirs被确定：文档频率权重是inverse-document-frequency或者idf.

wlocals 默认是 identity 函数。语料库的词频通过没有任何反应的识别函数，以及语料库本身return。于是，smartirs的另一个参数，term frequency weighting，自然或者n。现在我们有了词频和逆文档频率，我们可以计算 tfidf:

normalize 默认情况下为真，这意味着在计算 TfIDF 之后它会对 tfidf 向量进行归一化。归一化是用 l2-norm（欧几里德单位范数）完成的，这意味着我们最后的 smartirs 是余弦或 c。这部分实现为：

# vec(term_id, value) is tfidf result
length = 1.0 * math.sqrt(sum(val ** 2 for _, val in vec))
normalize_by_length = [(termid, val / length) for termid, val in vec]

当您调用 model[corpus] or model.__getitem__() 时，会发生以下情况：

__getitem__ 有一个 eps 参数，它是一个阈值，将删除所有 tfidf 值小于 eps 的条目。默认情况下，此值为 1e-12。结果，当您打印矢量时，只出现了其中的一部分。

gensim TfidfModel 的默认 smartirs 是什么？

What is the default smartirs for gensim TfidfModel?

python

nlp

information-retrieval

tf-idf

gensim