scikit-learn的TfidfVectorizer是如何计算TF-IDF的
How areTF-IDF calculated by the scikit-learn TfidfVectorizer
我运行下面的代码将文本矩阵转换为TF-IDF矩阵。
text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None)
X = vectorizer.fit_transform(text)
X_vovab = vectorizer.get_feature_names()
X_mat = X.todense()
X_idf = vectorizer.idf_
我得到以下输出
X_vovab =
[u'calculation',
u'computation',
u'idf',
u'product',
u'string',
u'tf',
u'tfidf']
和X_mat =
([[ 0. , 0. , 0. , 0. , 1.51082562,
0. , 0. ],
[ 0. , 0. , 0. , 0. , 1.51082562,
0. , 0. ],
[ 1.91629073, 1.91629073, 0. , 0. , 0. ,
0. , 1.51082562],
[ 0. , 0. , 1.91629073, 1.91629073, 0. ,
1.91629073, 1.51082562]])
现在我不明白这些分数是如何计算的。我的想法是,对于文本 [0],仅计算 'string' 的分数,并且在第 5 列中有一个分数。但由于 TF_IDF 是词频 2 和 IDF log(4/2) 的乘积,它是 1.39 而不是矩阵中所示的 1.51。 scikit-learn 中的 TF-IDF 分数是如何计算的。
精确的计算公式在docs中给出:
The actual formula used for tf-idf is tf * (idf + 1) = tf + tf * idf, instead of tf * idf
和
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once.
这意味着1.51082562
获得为1.51082562=1+ln((4+1)/(2+1))
TF-IDF是由Scikit Learn的TfidfVectorizer分多步完成的,其实就是使用了TfidfTransformer,继承了CountVectorizer
让我总结一下它所做的步骤以使其更直接:
- tfs 由 CountVectorizer 的 fit_transform()
计算
- idfs是通过TfidfTransformer的fit()计算出来的
- tfidfs是通过TfidfTransformer的transform()计算出来的
您可以查看源代码here。
回到你的例子。这是为词汇表的第 5 个术语、第一个文档 (X_mat[0,4]) 进行的 tfidf 权重的计算:
首先,'string' 的 tf,在第一个文档中:
tf = 1
其次,'string' 的 idf,启用平滑(默认行为):
df = 2
N = 4
idf = ln(N + 1 / df + 1) + 1 = ln (5 / 3) + 1 = 1.5108256238
最后,(文档 0,特征 4)的 tfidf 权重:
tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238
我注意到您选择不对 tfidf 矩阵进行归一化。请记住,对 tfidf 矩阵进行归一化是一种常见且通常推荐的方法,因为大多数模型都需要对特征矩阵(或设计矩阵)进行归一化。
TfidfVectorizer 默认将 L-2 归一化输出矩阵,作为计算的最后一步。对其进行归一化意味着它将只有 0 到 1 之间的权重。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
print(corpus)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
z=X.toarray()
#term frequency is printed
print(z)
vectorizer1 = TfidfVectorizer(min_df=1)
X1 = vectorizer1.fit_transform(corpus)
idf = vectorizer1.idf_
print (dict(zip(vectorizer1.get_feature_names(), idf)))
#printing idf
print(X1.toarray())
#printing tfidf
#formula
# df = 2
# N = 4
# idf = ln(N + 1 / df + 1) + 1 = log (5 / 3) + 1 = 1.5108256238
#formula
# tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238
我运行下面的代码将文本矩阵转换为TF-IDF矩阵。
text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None)
X = vectorizer.fit_transform(text)
X_vovab = vectorizer.get_feature_names()
X_mat = X.todense()
X_idf = vectorizer.idf_
我得到以下输出
X_vovab =
[u'calculation',
u'computation',
u'idf',
u'product',
u'string',
u'tf',
u'tfidf']
和X_mat =
([[ 0. , 0. , 0. , 0. , 1.51082562,
0. , 0. ],
[ 0. , 0. , 0. , 0. , 1.51082562,
0. , 0. ],
[ 1.91629073, 1.91629073, 0. , 0. , 0. ,
0. , 1.51082562],
[ 0. , 0. , 1.91629073, 1.91629073, 0. ,
1.91629073, 1.51082562]])
现在我不明白这些分数是如何计算的。我的想法是,对于文本 [0],仅计算 'string' 的分数,并且在第 5 列中有一个分数。但由于 TF_IDF 是词频 2 和 IDF log(4/2) 的乘积,它是 1.39 而不是矩阵中所示的 1.51。 scikit-learn 中的 TF-IDF 分数是如何计算的。
精确的计算公式在docs中给出:
The actual formula used for tf-idf is tf * (idf + 1) = tf + tf * idf, instead of tf * idf
和
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once.
这意味着1.51082562
获得为1.51082562=1+ln((4+1)/(2+1))
TF-IDF是由Scikit Learn的TfidfVectorizer分多步完成的,其实就是使用了TfidfTransformer,继承了CountVectorizer
让我总结一下它所做的步骤以使其更直接:
- tfs 由 CountVectorizer 的 fit_transform() 计算
- idfs是通过TfidfTransformer的fit()计算出来的
- tfidfs是通过TfidfTransformer的transform()计算出来的
您可以查看源代码here。
回到你的例子。这是为词汇表的第 5 个术语、第一个文档 (X_mat[0,4]) 进行的 tfidf 权重的计算:
首先,'string' 的 tf,在第一个文档中:
tf = 1
其次,'string' 的 idf,启用平滑(默认行为):
df = 2
N = 4
idf = ln(N + 1 / df + 1) + 1 = ln (5 / 3) + 1 = 1.5108256238
最后,(文档 0,特征 4)的 tfidf 权重:
tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238
我注意到您选择不对 tfidf 矩阵进行归一化。请记住,对 tfidf 矩阵进行归一化是一种常见且通常推荐的方法,因为大多数模型都需要对特征矩阵(或设计矩阵)进行归一化。
TfidfVectorizer 默认将 L-2 归一化输出矩阵,作为计算的最后一步。对其进行归一化意味着它将只有 0 到 1 之间的权重。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
print(corpus)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
z=X.toarray()
#term frequency is printed
print(z)
vectorizer1 = TfidfVectorizer(min_df=1)
X1 = vectorizer1.fit_transform(corpus)
idf = vectorizer1.idf_
print (dict(zip(vectorizer1.get_feature_names(), idf)))
#printing idf
print(X1.toarray())
#printing tfidf
#formula
# df = 2
# N = 4
# idf = ln(N + 1 / df + 1) + 1 = log (5 / 3) + 1 = 1.5108256238
#formula
# tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238