如何计算查询的 TF-IDF？

How do I calculate TF-IDF of a query?

如何计算查询的 tf-idf？我了解如何计算具有以下定义的一组文档的 tf-idf：

tf = occurances in document/ total words in document

idf = log(#documents / #documents where term occurs

但我不明白这与查询有何关联。

例如，我读到a resource，它说明了查询“life learning”

的值

life | tf = .5 | idf = 1.405507153 | tf_idf = 0.702753576
learning | tf = .5 | idf = 1.405507153 | tf_idf = 0.702753576

我理解的 tf 值，每个术语在两个可能的术语中只出现一次，因此是 1/2，但我不知道 idf 来自哪里。
我会认为 #documents = 1 和 occurrence = 1, log(1) = 0，所以 idf 将是 0，但似乎并非如此。它是基于您使用的任何文件吗？如何计算查询的 tf-idf？

只有 tf(life) 依赖于查询本身。但是，查询的idf依赖于后台文档，所以idf(life) = 1+ ln(3/2) ~= 1.405507153。这就是为什么 tf-idf 被定义为将局部成分（词频）与全局成分（逆文档频率）相乘。

假设您的查询是 best car insurance，您的总词汇量包含 car, best, auto, insurance 并且您有 N=1,000,000 文件。所以您的查询如下所示：

您的文档之一可能是：

现在计算 Query 和 Document 的 TF-IDF 之间的 余弦相似度 。

即使这个问题被标记为已回答。我觉得它没有得到完整的回答。所以如果将来有人需要这个：

But I have no idea where the idf comes from.

在这个例子中：Project 3, part 2: Searching using TF-IDF 介绍了如何计算查询和一组文档之间的余弦相似度。

正如@hypnoticpoisons所述，IDF是一个全局组件，因此每个文档的单词IDF都是相同的：

Note: technically, we are treating the query as if it were a new document. However, you should not recompute the IDF values: just use the ones you computed earlier.

如何计算查询的 TF-IDF？

How do I calculate TF-IDF of a query?

search

computer-science

data-retrieval

tf-idf