TfidVectorizer.transform(['word1 word2 word3']) returns 的矩阵是什么意思，它是如何计算的

Question

为了获得 tfidf maxtrix，我通过 sklearn.feature_extraction.text.TfidfVectorizer、

训练了 50000 个文档

from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(stop_words=stop_words_file_list,smooth_idf=True)
crops_vect = vec.fit_transform(crops).toarray()

我知道crops_vect行是每个文档，列是从整个语料库中提取的词，如crops_vect[document_id1]表示由语料库训练的tdidf构成的向量。我的问题是，vec.transform(['america strong'].toarray() 是什么意思：

np.where(vec.transform(['america strong']).toarray())
>>>(array([0, 0]), array([112609, 195997]))

[i for i in vec.transform(['america strong']).toarray()[0] if i != 0]
>>>[0.675671442580281, 0.7372028904456914]

[i for i in vec.transform(['strong']).toarray()[0] if i != 0]
>>>[1]

我查看了语料库中词'strong'的向量

np.array([i for i in crops_vect.T[195997].toarray()[0] ])
>>>array([0., 0., 0., ..., 0., 0., 0.])
np.where(np.array([i for i in crops_vect.T[195997].toarray()[0] ]))
>>>array([   20,   239,   250,   272,   303,   786,   797,   836,   924,
         1202,  1218,  1613,  1645,  1651,  1662,  1670,  1673,  1688,
         1691,  1697,  1721,  1728,  1766,  1780,  1849,  1935,  1975,
         1988,  1999,  2017,  2018,  2199,  2344,  2354,  2721,  2752,
         2775,  2785,  2788,  2809,  2818,  2826,  2830,  2841,  2844,
         .....]

我的问题是： 1) 我知道 vec.transform(['strong']).toarray() != crops_vect.T[195997].toarray(), 什么意思 vec.transform(['strong']).toarray()

2)vec.transform(['word1','word2']代表什么), 是不是相当于在之前训练好的tfidf矩阵中加入一个新文档['word1','word2']，然后计算新文档的新tdidf矩阵？

3)vec.transform(['word1','word2']),内部是怎么计算的

谢谢

Answer 1

TfidfVectorizer(stop_words=stop_words_file_list,smooth_idf=True) 所以idf计算公式为：

idf(t) = log [ n / df(t) ] + 1

vec.transform(['word1','word2'])是两个onehot向量垂直链接； vec.transform(['word1 word2'])是由两个单词"word1 word2"组成的文档，计算训练好的文档crops中word1的df和word1的idf 计算 df 和 idf ，最后归一化 v1/sqrt(v12 + v22),v2/sqrt(v12 + v22)

TfidVectorizer.transform(['word1 word2 word3']) returns 的矩阵是什么意思，它是如何计算的

What's the means about the matrix that TfidVectorizer.transform(['word1 word2 word3']) returns , and how does it calculate it

python

scikit-learn

tfidfvectorizer