smooth_idf 是多余的吗?

Is smooth_idf redundant?

scikit-learn documentation

If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1.

但是,为什么会df(d, t) = 0?如果某个术语没有出现在任何文本中,那么字典就不会首先包含该术语,对吗?

此功能在 TfidfVectorizer 中很有用。根据documentation,这个class可以预定义vocabulary。如果词汇表中的一个词从未在训练数据中出现,但在测试中出现,smooth_idf 允许它被成功处理。

train_texts = ['apple mango', 'mango banana']
test_texts = ['apple banana', 'mango orange']
vocab = ['apple', 'mango', 'banana', 'orange']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer1 = TfidfVectorizer(smooth_idf=True, vocabulary=vocab).fit(train_texts)
vectorizer2 = TfidfVectorizer(smooth_idf=False, vocabulary=vocab).fit(train_texts)
print(vectorizer1.transform(test_texts).todense()) # works okay
print(vectorizer2.transform(test_texts).todense()) # raises a ValueError

输出:

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.43016528  0.          0.90275015]]
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').