如何计算第 0 行 "good movies" 的输出?
How does output for "good movies" at 0th row is calculated?
code output
如何 "good movies" 结果是 0.707107 ,根据我的说法应该是: 1/1*ln(5/2) = 0.91629 .
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = [
"good movie", "not a good movie", "did not like",
"i like it", "good one"
]
# using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(
features.todense(),
columns=tfidf.get_feature_names()
)
因为 norm
和 smooth_idf
参数。默认情况下,两者都为真。
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = [
"good movie", "not a good movie", "did not like",
"i like it", "good one"
]
# using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(min_df=2, max_df=0.5,norm=None,smooth_idf=False, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(
features.todense(),
columns=tfidf.get_feature_names()
)
输出:
good movie like movie not
0 1.916291 0.000000 1.916291 0.000000
1 1.916291 0.000000 1.916291 1.916291
2 0.000000 1.916291 0.000000 1.916291
3 0.000000 1.916291 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000
默认情况下,sklearn计算idf使用的公式是log [ n / df(t) ] + 1
。所以根据你的计算结果是 0.91621 并加 1。
如果您执行 smooth_idf=True
(默认),则公式变为 log [ (1 + n) / (1 + df(d, t)) ] + 1
tfidf = TfidfVectorizer(min_df=2, max_df=0.5,norm=None,smooth_idf=True, ngram_range=(1, 2))
的输出是
good movie like movie not
0 1.693147 0.000000 1.693147 0.000000
1 1.693147 0.000000 1.693147 1.693147
2 0.000000 1.693147 0.000000 1.693147
3 0.000000 1.693147 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000
How 0.707107??
如果您看到第一行有两次 1.693417(称为 a),则 l2 范数为 sqrt(a^2 + a^2),即 sqrt(1.69 ^ 2 + 1.69 ^ 2 ) = sqrt(5.73349),等于 2.3944。现在你除以 1.693147/2.3944,你大约得到 0.707107.
code output
如何 "good movies" 结果是 0.707107 ,根据我的说法应该是: 1/1*ln(5/2) = 0.91629 .
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = [
"good movie", "not a good movie", "did not like",
"i like it", "good one"
]
# using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(
features.todense(),
columns=tfidf.get_feature_names()
)
因为 norm
和 smooth_idf
参数。默认情况下,两者都为真。
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = [
"good movie", "not a good movie", "did not like",
"i like it", "good one"
]
# using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(min_df=2, max_df=0.5,norm=None,smooth_idf=False, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(
features.todense(),
columns=tfidf.get_feature_names()
)
输出:
good movie like movie not
0 1.916291 0.000000 1.916291 0.000000
1 1.916291 0.000000 1.916291 1.916291
2 0.000000 1.916291 0.000000 1.916291
3 0.000000 1.916291 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000
默认情况下,sklearn计算idf使用的公式是log [ n / df(t) ] + 1
。所以根据你的计算结果是 0.91621 并加 1。
如果您执行 smooth_idf=True
(默认),则公式变为 log [ (1 + n) / (1 + df(d, t)) ] + 1
tfidf = TfidfVectorizer(min_df=2, max_df=0.5,norm=None,smooth_idf=True, ngram_range=(1, 2))
的输出是
good movie like movie not
0 1.693147 0.000000 1.693147 0.000000
1 1.693147 0.000000 1.693147 1.693147
2 0.000000 1.693147 0.000000 1.693147
3 0.000000 1.693147 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000
How 0.707107??