Python gensim (TfidfModel):Tf-Idf 是如何计算的?
Python gensim (TfidfModel): How is the Tf-Idf computed?
1.对于下面的测试文本,
test=['test test', 'test toy']
tf-idf 分数[没有归一化(smartirs:'ntn')]是
[['test', 1.17]]
[['test', 0.58], ['toy', 1.58]]
这似乎与我通过
直接计算得到的结果不符
tfidf (w, d) = tf x idf
where idf(term)=log (total number of documents / number of documents containing term)
tf = number of instances of word in d document / total number of words of d document
例
doc 1: 'test test'
for "test" word
tf= 1
idf= log(2/2) = 0
tf-idf = 0
谁能用我上面的测试文本告诉我计算结果?
2)当我改成余弦归一化(smartirs:'ntc'),我得到
[['test', 1.0]]
[['test', 0.35], ['toy', 0.94]]
有人能告诉我计算结果吗?
谢谢
import gensim
from gensim import corpora
from gensim import models
import numpy as np
from gensim.utils import simple_preprocess
test=['test test', 'test toy']
texts = [simple_preprocess(doc) for doc in test]
mydict= corpora.Dictionary(texts)
mycorpus = [mydict.doc2bow(doc, allow_update=True) for doc in texts]
tfidf = models.TfidfModel(mycorpus, smartirs='ntn')
for doc in tfidf[mycorpus]:
print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])
如果您想知道 model.TfidfModel
的实施细节,您可以直接在 GitHub repository for gensim. The particular calculation scheme corresponding to smartirs='ntn'
is described on the Wikipedia page for SMART Information Retrieval System 中查看它们,确切的计算与您使用的不同,因此在结果。
例如您所指的特定差异:
idf= log(2/2) = 0
实际上应该是 log2(N+1/n_k):
idf= log(2/1) = 1
我建议您检查实施和提到的页面,以确保您的手动检查遵循所选 smartirs
标志的实施。
1.对于下面的测试文本,
test=['test test', 'test toy']
tf-idf 分数[没有归一化(smartirs:'ntn')]是
[['test', 1.17]]
[['test', 0.58], ['toy', 1.58]]
这似乎与我通过
直接计算得到的结果不符tfidf (w, d) = tf x idf
where idf(term)=log (total number of documents / number of documents containing term)
tf = number of instances of word in d document / total number of words of d document
例
doc 1: 'test test'
for "test" word
tf= 1
idf= log(2/2) = 0
tf-idf = 0
谁能用我上面的测试文本告诉我计算结果?
2)当我改成余弦归一化(smartirs:'ntc'),我得到
[['test', 1.0]]
[['test', 0.35], ['toy', 0.94]]
有人能告诉我计算结果吗?
谢谢
import gensim
from gensim import corpora
from gensim import models
import numpy as np
from gensim.utils import simple_preprocess
test=['test test', 'test toy']
texts = [simple_preprocess(doc) for doc in test]
mydict= corpora.Dictionary(texts)
mycorpus = [mydict.doc2bow(doc, allow_update=True) for doc in texts]
tfidf = models.TfidfModel(mycorpus, smartirs='ntn')
for doc in tfidf[mycorpus]:
print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])
如果您想知道 model.TfidfModel
的实施细节,您可以直接在 GitHub repository for gensim. The particular calculation scheme corresponding to smartirs='ntn'
is described on the Wikipedia page for SMART Information Retrieval System 中查看它们,确切的计算与您使用的不同,因此在结果。
例如您所指的特定差异:
idf= log(2/2) = 0
实际上应该是 log2(N+1/n_k):
idf= log(2/1) = 1
我建议您检查实施和提到的页面,以确保您的手动检查遵循所选 smartirs
标志的实施。