sklearn 的 TfidfVectorizer 词频?
sklearn's TfidfVectorizer word frequency?
我对 sklearn 的 TfidfVectorizer 在处理每个文档中单词的频率时有疑问。
我看到的示例代码是:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
>>> 'The dog ate a sandwich and I ate a sandwich',
>>> 'The wizard transfigured a sandwich'
>>> ]
>>> vectorizer = TfidfVectorizer(stop_words='english')
>>> print vectorizer.fit_transform(corpus).todense()
[[ 0.75458397 0.37729199 0.53689271 0. 0. ]
[ 0. 0. 0.44943642 0.6316672 0.6316672 ]]
我的问题是:如何解释矩阵中的数字?我知道 0 意味着单词即向导在第一个文档中出现了 0 次,因此它是 0,但是我如何解释数字 0.75458397?是第一个文档中"ate"这个词出现的频率吗?还是单词 "ate" 在整个语料库中出现的频率?
TF-IDF(意思是 "term frequency - inverse document frequency")是 而不是 给你一个术语在其表示中的频率。
TF-IDF 对仅出现在极少数文档中的术语给出高分,而对出现在许多文档中的术语给出低分,因此它粗略地衡量了一个术语在给定文档中的辨别力。查看 this 资源以找到对 TF-IDF 的出色描述并更好地了解它在做什么。
如果您只想要计数,则需要使用 CountVectorizer
。
我想你忘记了 TF-IDF 向量通常是归一化的,所以它们的幅度(长度或 2 范数)始终为 1。
所以TFIDF值0.75
是"ate"的频率乘以"ate"的逆文档频率然后除以幅度 TF-IDF 向量。
这里是所有肮脏的细节(跳到tfidf0 =
看妙语):
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["The dog ate a sandwich and I ate a sandwich",
"The wizard transfigured a sandwich"]
vectorizer = TfidfVectorizer(stop_words='english')
tfidfs = vectorizer.fit_transform(corpus)
from collections import Counter
import pandas as pd
columns = [k for (v, k) in sorted((v, k)
for k, v in vectorizer.vocabulary_.items())]
tfidfs = pd.DataFrame(tfidfs.todense(),
columns=columns)
# ate dog sandwich transfigured wizard
#0 0.75 0.38 0.54 0.00 0.00
#1 0.00 0.00 0.45 0.63 0.63
df = (1 / pd.DataFrame([vectorizer.idf_], columns=columns))
# ate dog sandwich transfigured wizard
#0 0.71 0.71 1.0 0.71 0.71
corp = [txt.lower().split() for txt in corpus]
corp = [[w for w in d if w in vectorizer.vocabulary_] for d in corp]
tfs = pd.DataFrame([Counter(d) for d in corp]).fillna(0).astype(int)
# ate dog sandwich transfigured wizard
#0 2 1 2 0 0
#1 0 0 1 1 1
# The first document's TFIDF vector:
tfidf0 = tfs.iloc[0] * (1. / df)
tfidf0 = tfidf0 / pd.np.linalg.norm(tfidf0)
# ate dog sandwich transfigured wizard
#0 0.754584 0.377292 0.536893 0.0 0.0
tfidf1 = tfs.iloc[1] * (1. / df)
tfidf1 = tfidf1 / pd.np.linalg.norm(tfidf1)
# ate dog sandwich transfigured wizard
#0 0.0 0.0 0.449436 0.631667 0.631667
只要打印下面的代码,你就会看到类似的输出
#(0, 1) 0.448320873199 Document 1, term = Dog
#(0, 3) 0.630099344518 Document 1, term = Sandwitch
print(vectorizer.fit_transform(corpus))
# if python 3 other wise remove () in print
注意:如果您只有 unigrams
,请使用此选项
sklearn 的 tfidfvectorizer 不会直接给你计数。
要获得计数,您可以使用 TfidfVectorizer
class 方法 inverse_transform
和 build_tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'The dog ate a sandwich and I ate a sandwich',
'The wizard transfigured a sandwich'
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
X_words = tfidf.inverse_transform(X) ## this will give you words instead of tfidf where tfidf > 0
tokenizer = vectorizer.build_tokenizer() ## return tokenizer function used in tfidfvectorizer
for idx,words in enumerate(X_words):
for word in words:
count = tokenizer(corpus[idx]).count(word)
print(idx,word,count)
输出
0 dog 1
0 ate 2
0 sandwich 2
1 sandwich 1
1 wizard 1
1 transfigured 1
#0 means first sentence in corpus
这是一个解决方法,希望能对某人有所帮助 :)
行中应该是vectorizer
X_words = tfidf.inverse_transform(X)
而不是 tfidf.
我对 sklearn 的 TfidfVectorizer 在处理每个文档中单词的频率时有疑问。
我看到的示例代码是:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
>>> 'The dog ate a sandwich and I ate a sandwich',
>>> 'The wizard transfigured a sandwich'
>>> ]
>>> vectorizer = TfidfVectorizer(stop_words='english')
>>> print vectorizer.fit_transform(corpus).todense()
[[ 0.75458397 0.37729199 0.53689271 0. 0. ]
[ 0. 0. 0.44943642 0.6316672 0.6316672 ]]
我的问题是:如何解释矩阵中的数字?我知道 0 意味着单词即向导在第一个文档中出现了 0 次,因此它是 0,但是我如何解释数字 0.75458397?是第一个文档中"ate"这个词出现的频率吗?还是单词 "ate" 在整个语料库中出现的频率?
TF-IDF(意思是 "term frequency - inverse document frequency")是 而不是 给你一个术语在其表示中的频率。
TF-IDF 对仅出现在极少数文档中的术语给出高分,而对出现在许多文档中的术语给出低分,因此它粗略地衡量了一个术语在给定文档中的辨别力。查看 this 资源以找到对 TF-IDF 的出色描述并更好地了解它在做什么。
如果您只想要计数,则需要使用 CountVectorizer
。
我想你忘记了 TF-IDF 向量通常是归一化的,所以它们的幅度(长度或 2 范数)始终为 1。
所以TFIDF值0.75
是"ate"的频率乘以"ate"的逆文档频率然后除以幅度 TF-IDF 向量。
这里是所有肮脏的细节(跳到tfidf0 =
看妙语):
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["The dog ate a sandwich and I ate a sandwich",
"The wizard transfigured a sandwich"]
vectorizer = TfidfVectorizer(stop_words='english')
tfidfs = vectorizer.fit_transform(corpus)
from collections import Counter
import pandas as pd
columns = [k for (v, k) in sorted((v, k)
for k, v in vectorizer.vocabulary_.items())]
tfidfs = pd.DataFrame(tfidfs.todense(),
columns=columns)
# ate dog sandwich transfigured wizard
#0 0.75 0.38 0.54 0.00 0.00
#1 0.00 0.00 0.45 0.63 0.63
df = (1 / pd.DataFrame([vectorizer.idf_], columns=columns))
# ate dog sandwich transfigured wizard
#0 0.71 0.71 1.0 0.71 0.71
corp = [txt.lower().split() for txt in corpus]
corp = [[w for w in d if w in vectorizer.vocabulary_] for d in corp]
tfs = pd.DataFrame([Counter(d) for d in corp]).fillna(0).astype(int)
# ate dog sandwich transfigured wizard
#0 2 1 2 0 0
#1 0 0 1 1 1
# The first document's TFIDF vector:
tfidf0 = tfs.iloc[0] * (1. / df)
tfidf0 = tfidf0 / pd.np.linalg.norm(tfidf0)
# ate dog sandwich transfigured wizard
#0 0.754584 0.377292 0.536893 0.0 0.0
tfidf1 = tfs.iloc[1] * (1. / df)
tfidf1 = tfidf1 / pd.np.linalg.norm(tfidf1)
# ate dog sandwich transfigured wizard
#0 0.0 0.0 0.449436 0.631667 0.631667
只要打印下面的代码,你就会看到类似的输出
#(0, 1) 0.448320873199 Document 1, term = Dog
#(0, 3) 0.630099344518 Document 1, term = Sandwitch
print(vectorizer.fit_transform(corpus))
# if python 3 other wise remove () in print
注意:如果您只有 unigrams
,请使用此选项sklearn 的 tfidfvectorizer 不会直接给你计数。
要获得计数,您可以使用 TfidfVectorizer
class 方法 inverse_transform
和 build_tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'The dog ate a sandwich and I ate a sandwich',
'The wizard transfigured a sandwich'
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
X_words = tfidf.inverse_transform(X) ## this will give you words instead of tfidf where tfidf > 0
tokenizer = vectorizer.build_tokenizer() ## return tokenizer function used in tfidfvectorizer
for idx,words in enumerate(X_words):
for word in words:
count = tokenizer(corpus[idx]).count(word)
print(idx,word,count)
输出
0 dog 1
0 ate 2
0 sandwich 2
1 sandwich 1
1 wizard 1
1 transfigured 1
#0 means first sentence in corpus
这是一个解决方法,希望能对某人有所帮助 :)
行中应该是vectorizer
X_words = tfidf.inverse_transform(X)
而不是 tfidf.