了解 TfidfVectorizer 输出
Understanding TfidfVectorizer output
我正在用简单的例子测试TfidfVectorizer
,但我不知道结果。
corpus = ["I'd like an apple",
"An apple a day keeps the doctor away",
"Never compare an apple to an orange",
"I prefer scikit-learn to Orange",
"The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
print(vect.get_feature_names())
print(tfidf.shape)
print(tfidf)
输出:
['apple', 'away', 'blue', 'compare', 'day', 'docs', 'doctor', 'keeps', 'learn', 'like', 'orange', 'prefer', 'scikit']
(5, 13)
(0, 0) 0.5564505207186616
(0, 9) 0.830880748357988
...
我正在计算第一句话的 tfidf
,我得到了不同的结果:
- 第一个文档 ("
I'd like an apple
") 仅包含 2 个词(在删除停用词后(根据 vect.get_feature_names()
的打印(我们保留:"like
", "apple
")
- TF("苹果", Doucment_1) = 1/2 = 0.5
- TF("喜欢", Doucment_1) = 1/2 = 0.5
- 单词
apple
在语料库中出现了 3 次。
- 单词
like
在语料库中出现了 1 次。
- IDF(“苹果”)= ln(5/3) = 0.51082
- IDF(“喜欢”)= ln(5/1) = 1.60943
所以:
tfidf("apple")
文档 1 = 0.5 * 0.51082 = 0.255 != 0.5564
tfidf("like")
文档 1 = 0.5 * 1.60943 = 0.804 != 0.8308
我错过了什么?
你的计算有几个问题。
首先,关于如何计算TF有多种约定(见Wikipedia entry); scikit-learn does not normalize it with the document length. From the user guide:
[...] the term frequency, the number of times a term occurs in a given document [...]
所以,这里是 TF("apple", Document_1) = 1
,而不是 0.5
Second,关于 IDF 定义 - 来自 docs:
If smooth_idf=True
(the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.
所以,这里我们将有
IDF ("apple") = ln(5+1/3+1) + 1 = 1.4054651081081644
因此
TF-IDF("apple") = 1 * 1.4054651081081644 = 1.4054651081081644
第三,默认设置norm='l2'
,有一个额外的规范化发生;再次来自文档:
Normalization is “c” (cosine) when norm='l2'
, “n” (none) when norm=None
.
从您的示例中明确删除此额外的规范化,即
vect = TfidfVectorizer(min_df=1, stop_words="english", norm=None)
给予 'apple'
(0, 0) 1.4054651081081644
即已手动计算
有关 norm='l2'
(默认设置)时归一化如何影响计算的详细信息,请参阅用户指南的 Tf–idf term weighting 部分;他们自己承认:
the tf-idfs computed in scikit-learn’s TfidfTransformer
and TfidfVectorizer
differ slightly from the standard textbook notation
我正在用简单的例子测试TfidfVectorizer
,但我不知道结果。
corpus = ["I'd like an apple",
"An apple a day keeps the doctor away",
"Never compare an apple to an orange",
"I prefer scikit-learn to Orange",
"The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
print(vect.get_feature_names())
print(tfidf.shape)
print(tfidf)
输出:
['apple', 'away', 'blue', 'compare', 'day', 'docs', 'doctor', 'keeps', 'learn', 'like', 'orange', 'prefer', 'scikit']
(5, 13)
(0, 0) 0.5564505207186616
(0, 9) 0.830880748357988
...
我正在计算第一句话的 tfidf
,我得到了不同的结果:
- 第一个文档 ("
I'd like an apple
") 仅包含 2 个词(在删除停用词后(根据vect.get_feature_names()
的打印(我们保留:"like
", "apple
") - TF("苹果", Doucment_1) = 1/2 = 0.5
- TF("喜欢", Doucment_1) = 1/2 = 0.5
- 单词
apple
在语料库中出现了 3 次。 - 单词
like
在语料库中出现了 1 次。 - IDF(“苹果”)= ln(5/3) = 0.51082
- IDF(“喜欢”)= ln(5/1) = 1.60943
所以:
tfidf("apple")
文档 1 = 0.5 * 0.51082 = 0.255 != 0.5564tfidf("like")
文档 1 = 0.5 * 1.60943 = 0.804 != 0.8308
我错过了什么?
你的计算有几个问题。
首先,关于如何计算TF有多种约定(见Wikipedia entry); scikit-learn does not normalize it with the document length. From the user guide:
[...] the term frequency, the number of times a term occurs in a given document [...]
所以,这里是 TF("apple", Document_1) = 1
,而不是 0.5
Second,关于 IDF 定义 - 来自 docs:
If
smooth_idf=True
(the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.
所以,这里我们将有
IDF ("apple") = ln(5+1/3+1) + 1 = 1.4054651081081644
因此
TF-IDF("apple") = 1 * 1.4054651081081644 = 1.4054651081081644
第三,默认设置norm='l2'
,有一个额外的规范化发生;再次来自文档:
Normalization is “c” (cosine) when
norm='l2'
, “n” (none) whennorm=None
.
从您的示例中明确删除此额外的规范化,即
vect = TfidfVectorizer(min_df=1, stop_words="english", norm=None)
给予 'apple'
(0, 0) 1.4054651081081644
即已手动计算
有关 norm='l2'
(默认设置)时归一化如何影响计算的详细信息,请参阅用户指南的 Tf–idf term weighting 部分;他们自己承认:
the tf-idfs computed in scikit-learn’s
TfidfTransformer
andTfidfVectorizer
differ slightly from the standard textbook notation