Tfidfvectorizer

Question

假设我用于单个文档

text="bla agao haa"
singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range= 
(4,6),preprocessor=my_tokenizer, max_features=100).fit([text])

single=singleTFIDF.transform([text])
query = singleTFIDF.transform(["new coming document"])

如果我理解正确，transform 只是使用从 fit 中学到的权重。因此，对于新文档，查询包含文档中每个特征的权重。看起来像 [[0,0,0.13,0.4,0]]

因为我使用 n-gram，所以我也想获得这个新文档的功能。所以我知道新文档中每个特征的权重。

编辑：

在我的例子中，我获取了 single 并查询了以下数组：

single
[[0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125]]
query
[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.57735027 0.57735027 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.57735027 0.         0.
  0.         0.         0.        ]]

但这很奇怪，因为从学习的语料库（单个）来看，所有特征的权重都是 0.10721125。那么新文档的一个特征怎么会有0.57735027的权重呢？

Answer 1

我们提供了有关 Scikit-Learn 如何计算 tfidf 的详细信息here，这里是使用单词 n-gram 的实现示例。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Train the vectorizer
text="this is a simple example"
singleTFIDF = TfidfVectorizer(ngram_range=(1,2)).fit([text])
singleTFIDF.vocabulary_ # show the word-matrix position pairs

# Analyse the training string - text
single=singleTFIDF.transform([text])
single.toarray()  # displays the resulting matrix - all values are equal because all terms are present

# Analyse two new strings with the trained vectorizer
doc_1 = ['is this example working', 'hopefully it is a good example', 'no matching words here']

query = singleTFIDF.transform(doc_1)
query.toarray() # displays the resulting matrix - only matched terms have non-zero values

# Compute the cosine similarity between text and doc_1 - the second string has only two matching terms, therefore it has a lower similarity value
cos_similarity = cosine_similarity(single.A, query.A)

输出：

singleTFIDF.vocabulary_ 
Out[297]: 
{'this': 5,
 'is': 1,
 'simple': 3,
 'example': 0,
 'this is': 6,
 'is simple': 2,
 'simple example': 4}

single.toarray()
Out[299]: 
array([[0.37796447, 0.37796447, 0.37796447, 0.37796447, 0.37796447,
        0.37796447, 0.37796447]])

query.toarray()
Out[311]: 
array([[0.57735027, 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        ],
       [0.70710678, 0.70710678, 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ]])

np.sum(np.square(query.toarray()), axis=1) # note how all rows with non-zero scores have been normalised to 1.
Out[3]: array([1., 1., 0.])

cos_similarity
Out[313]: array([[0.65465367, 0.53452248, 0.        ]])

Answer 2

新文档有新的权重，因为 tfidfvectorizer 规范化了权重。因此将参数 norm 设置为 None。 norm 的默认值为 l2.

要了解更多关于规范的影响，我建议您查看我对问题的回答。

Tfidfvectorizer - 从变换中获取具有权重的特征

Tfidfvectorizer - get features with weights from transform

python

scikit-learn