了解 TfidfVectorizer 输出

Understanding TfidfVectorizer output

我正在用简单的例子测试TfidfVectorizer,但我不知道结果。

corpus = ["I'd like an apple",
          "An apple a day keeps the doctor away",
          "Never compare an apple to an orange",
          "I prefer scikit-learn to Orange",
          "The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)

print(vect.get_feature_names())    
print(tfidf.shape)
print(tfidf)

输出:

['apple', 'away', 'blue', 'compare', 'day', 'docs', 'doctor', 'keeps', 'learn', 'like', 'orange', 'prefer', 'scikit']
(5, 13)
  (0, 0)    0.5564505207186616
  (0, 9)    0.830880748357988
  ...

我正在计算第一句话的 tfidf,我得到了不同的结果:

所以:

我错过了什么?

你的计算有几个问题。

首先,关于如何计算TF有多种约定(见Wikipedia entry); scikit-learn does not normalize it with the document length. From the user guide:

[...] the term frequency, the number of times a term occurs in a given document [...]

所以,这里是 TF("apple", Document_1) = 1,而不是 0.5

Second,关于 IDF 定义 - 来自 docs:

If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.

所以,这里我们将有

IDF ("apple") = ln(5+1/3+1) + 1 = 1.4054651081081644

因此

TF-IDF("apple") = 1 * 1.4054651081081644 =  1.4054651081081644

第三,默认设置norm='l2',有一个额外的规范化发生;再次来自文档:

Normalization is “c” (cosine) when norm='l2', “n” (none) when norm=None.

从您的示例中明确删除此额外的规范化,即

vect = TfidfVectorizer(min_df=1, stop_words="english", norm=None)

给予 'apple'

(0, 0)  1.4054651081081644

即已手动计算

有关 norm='l2'(默认设置)时归一化如何影响计算的详细信息,请参阅用户指南的 Tf–idf term weighting 部分;他们自己承认:

the tf-idfs computed in scikit-learn’s TfidfTransformer and TfidfVectorizer differ slightly from the standard textbook notation